Hello list,
I am not sure if this is the right forum to post this question. I searched
the swish-e website for a developer list, but could not find any. If I am
posting in the wrong forum, please excuse!
I am the developer and maintainer of an open source web crawler
program in Python named HarvestMan
(http://developer.berlios.de/projects/harvestman). As part of the
request
from a user, I have integrated HarvestMan with swish-e, enabling HarvestMan
to work as an external program for webcrawling, using the "-S prog" option.
This work is complete.
The crawling and indexing works well for small crawls of say upto
a maximum of 50-100 files. However, when crawling and indexing sites with
a lot of HTML files, swish-e keeps failing with a "Broken Pipe" error. I am
assuming that the way swish-e does the indexing of the external
program's output is to open a pipe to read the programs STDOUT and
index it.
The following is a snippet of the error when indexing current module
documentation of Python at
http://www.python.org/doc/current/modindex.html.
<QUOTE>
anand@anand-laptop:~/projects/HarvestMan-2.0/HarvestMan$ swish-e -c
examples/swish-config.conf -S prog
Indexing Data Source: "External-Program"
Indexing "./harvestman.py"
External Program found: ./harvestman.py
Warning: Unknown header line: 'me:
http://www.python.org/doc/current/lib/module-main.html' from program
./harvestman.py
err: External program failed to return required headers Path-Name:
.
anand@anand-laptop:~/projects/HarvestMan-2.0/HarvestMan$ Exception in
thread fetcher0:
Traceback (most recent call last):
File "threading.py", line 442, in __bootstrap
self.run()
File "/home/anand/projects/HarvestMan-2.0/HarvestMan/crawler.py",
line 201, in run
self.action()
File "/home/anand/projects/HarvestMan-2.0/HarvestMan/crawler.py",
line 696, in action
self.process_url()
File "/home/anand/projects/HarvestMan-2.0/HarvestMan/common/methodwrapper.py",
line 80, in method
post(self, x, *args, **kwargs)
File "/home/anand/projects/HarvestMan-2.0/HarvestMan/plugins/swish-e.py",
line 40, in process_url_further
sys.stdout.flush()
IOError: [Errno 32] Broken pipe
</QUOTE>
Python is clearly showing that this is a case of a broken pipe. HarvestMan is
a multithreaded program which means that multiple threads are crawling and
downloading files at the same time and writing to STDOUT. I have tried
increasing the time period between two threads writing to STDOUT and also
tried to run the program with 1-2 threads. Still not much success with large
crawls.
The swish-e support is added as a HarvestMan plugin. The current source
code can be seen and downloaded from berlios CVS.
http://cvs.berlios.de/cgi-bin/viewcvs.cgi/harvestman/HarvestMan-2.0/
I have been able to run HarvestMan with the swish-e plugin when it
just prints the required information to STDOUT without actually calling swish-e.
This works fine without any issues.
Can someone let me know where in swish-e source code should I look
to try and fix this issue ? Is there any configuration parameter that
controls the
input buffer and piping when reading external program output ?
On a happier note, I have been able to crawl and index smaller sites
as mentioned.
For example the swish-e docs URL {http://swish-e.org/docs} indexes without
any issue. The swish-e integration is one of the better features for
this release
of HarvestMan (2.0), so it would be nice if this annoying bug is fixed.
Thanks for your help.
Regards
--
-Anand
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Tue May 8 03:04:20 2007