Sometimes this is an indication that you have a multi-byte character
in your content.
Bill
On May 8, 2007, at 12:06 AM, Anand Pillai wrote:
> I am using the latest version of swish-e. Here is the version
> information.
>
> anand@anand-laptop:~/projects/HarvestMan-2.0/HarvestMan$ swish-e -V
> SWISH-E 2.4.4
>
> Running on Ubuntu 6.10 on an Intel Dual core 1.83 GHZ with 1 GB RAM
>
> -Anand
>
> On 5/8/07, Anand Pillai <abpillai@gmail.com> wrote:
>> Hello list,
>>
>> I am not sure if this is the right forum to post this question.
>> I searched
>> the swish-e website for a developer list, but could not find any.
>> If I am
>> posting in the wrong forum, please excuse!
>>
>> I am the developer and maintainer of an open source web crawler
>> program in Python named HarvestMan
>> (http://developer.berlios.de/projects/harvestman). As part of the
>> request
>> from a user, I have integrated HarvestMan with swish-e, enabling
>> HarvestMan
>> to work as an external program for webcrawling, using the "-S
>> prog" option.
>> This work is complete.
>>
>> The crawling and indexing works well for small crawls of say upto
>> a maximum of 50-100 files. However, when crawling and indexing
>> sites with
>> a lot of HTML files, swish-e keeps failing with a "Broken Pipe"
>> error. I am
>> assuming that the way swish-e does the indexing of the external
>> program's output is to open a pipe to read the programs STDOUT and
>> index it.
>>
>> The following is a snippet of the error when indexing current module
>> documentation of Python at
>> http://www.python.org/doc/current/modindex.html.
>>
>> <QUOTE>
>> anand@anand-laptop:~/projects/HarvestMan-2.0/HarvestMan$ swish-e -c
>> examples/swish-config.conf -S prog
>> Indexing Data Source: "External-Program"
>> Indexing "./harvestman.py"
>> External Program found: ./harvestman.py
>>
>> Warning: Unknown header line: 'me:
>> http://www.python.org/doc/current/lib/module-main.html' from program
>> ./harvestman.py
>> err: External program failed to return required headers Path-Name:
>> .
>> anand@anand-laptop:~/projects/HarvestMan-2.0/HarvestMan$ Exception in
>> thread fetcher0:
>> Traceback (most recent call last):
>> File "threading.py", line 442, in __bootstrap
>> self.run()
>> File "/home/anand/projects/HarvestMan-2.0/HarvestMan/crawler.py",
>> line 201, in run
>> self.action()
>> File "/home/anand/projects/HarvestMan-2.0/HarvestMan/crawler.py",
>> line 696, in action
>> self.process_url()
>> File "/home/anand/projects/HarvestMan-2.0/HarvestMan/common/
>> methodwrapper.py",
>> line 80, in method
>> post(self, x, *args, **kwargs)
>> File "/home/anand/projects/HarvestMan-2.0/HarvestMan/plugins/
>> swish-e.py",
>> line 40, in process_url_further
>> sys.stdout.flush()
>> IOError: [Errno 32] Broken pipe
>> </QUOTE>
>>
>> Python is clearly showing that this is a case of a broken pipe.
>> HarvestMan is
>> a multithreaded program which means that multiple threads are
>> crawling and
>> downloading files at the same time and writing to STDOUT. I have
>> tried
>> increasing the time period between two threads writing to STDOUT
>> and also
>> tried to run the program with 1-2 threads. Still not much success
>> with large
>> crawls.
>>
>> The swish-e support is added as a HarvestMan plugin. The current
>> source
>> code can be seen and downloaded from berlios CVS.
>>
>> http://cvs.berlios.de/cgi-bin/viewcvs.cgi/harvestman/HarvestMan-2.0/
>>
>> I have been able to run HarvestMan with the swish-e plugin when it
>> just prints the required information to STDOUT without actually
>> calling swish-e.
>> This works fine without any issues.
>>
>> Can someone let me know where in swish-e source code should I look
>> to try and fix this issue ? Is there any configuration parameter that
>> controls the
>> input buffer and piping when reading external program output ?
>>
>> On a happier note, I have been able to crawl and index smaller sites
>> as mentioned.
>> For example the swish-e docs URL {http://swish-e.org/docs} indexes
>> without
>> any issue. The swish-e integration is one of the better features for
>> this release
>> of HarvestMan (2.0), so it would be nice if this annoying bug is
>> fixed.
>>
>> Thanks for your help.
>>
>> Regards
>> --
>> -Anand
>>
>
>
> --
> -Anand
> _______________________________________________
> Users mailing list
> Users@lists.swish-e.org
> http://lists.swish-e.org/listinfo/users
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Tue May 8 03:24:45 2007