Skip to main content.
home | support | download

Back to List Archive

[swish-e] Swish-e and HarvestMan

From: Anand Pillai <abpillai(at)not-real.gmail.com>
Date: Tue May 08 2007 - 07:04:16 GMT
Hello list,

  I am not sure if this is the right forum to post this question. I searched
the swish-e website for a developer list, but could not find any. If I am
posting in the wrong forum, please excuse!

 I am the developer and maintainer of an open source web crawler
program in Python named HarvestMan
(http://developer.berlios.de/projects/harvestman). As part of the
request
from a user, I have integrated HarvestMan with swish-e, enabling HarvestMan
to work as an external program for webcrawling, using the "-S prog" option.
This work is complete.

The crawling and indexing works well for small crawls of say upto
a maximum of 50-100 files. However, when crawling and indexing sites with
a lot of HTML files, swish-e keeps failing with a "Broken Pipe" error. I am
assuming that the way swish-e does the indexing of the external
program's output is to open a pipe to read the programs STDOUT and
index it.

The following is a snippet of the error when indexing  current module
documentation of Python at
http://www.python.org/doc/current/modindex.html.

<QUOTE>
anand@anand-laptop:~/projects/HarvestMan-2.0/HarvestMan$ swish-e -c
examples/swish-config.conf -S prog
Indexing Data Source: "External-Program"
Indexing "./harvestman.py"
External Program found: ./harvestman.py

Warning: Unknown header line: 'me:
http://www.python.org/doc/current/lib/module-main.html' from program
./harvestman.py
err: External program failed to return required headers Path-Name:
.
anand@anand-laptop:~/projects/HarvestMan-2.0/HarvestMan$ Exception in
thread fetcher0:
Traceback (most recent call last):
  File "threading.py", line 442, in __bootstrap
    self.run()
  File "/home/anand/projects/HarvestMan-2.0/HarvestMan/crawler.py",
line 201, in run
    self.action()
  File "/home/anand/projects/HarvestMan-2.0/HarvestMan/crawler.py",
line 696, in action
    self.process_url()
  File "/home/anand/projects/HarvestMan-2.0/HarvestMan/common/methodwrapper.py",
line 80, in method
    post(self, x, *args, **kwargs)
  File "/home/anand/projects/HarvestMan-2.0/HarvestMan/plugins/swish-e.py",
line 40, in process_url_further
    sys.stdout.flush()
IOError: [Errno 32] Broken pipe
</QUOTE>

Python is clearly showing that this is a case of a broken pipe. HarvestMan is
a multithreaded program which means that multiple threads are crawling and
downloading files at the same time and writing to STDOUT. I have tried
increasing the time period between two threads writing to STDOUT and also
tried to run the program with 1-2 threads. Still not much success with large
crawls.

The swish-e support is added as a HarvestMan plugin. The current source
code can be seen and downloaded from berlios CVS.

http://cvs.berlios.de/cgi-bin/viewcvs.cgi/harvestman/HarvestMan-2.0/

I have been able to run HarvestMan with the swish-e plugin when it
just prints the required information to STDOUT without actually calling swish-e.
This works fine without any issues.

Can someone let me know where in swish-e source code should I look
to try and fix this issue ? Is there any configuration parameter that
controls the
input buffer and piping when reading external program output ?

On a happier note, I have been able to crawl and index smaller sites
as mentioned.
For example the swish-e docs URL {http://swish-e.org/docs} indexes without
any issue. The swish-e integration is one of the better features for
this release
of HarvestMan (2.0), so it would be nice if this annoying bug is fixed.

Thanks for your help.

Regards
-- 
-Anand
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Tue May 8 03:04:20 2007