Hi,
We had the same problem on RH9, the solution was to do this:
LANG=en_GB
export LANG
in a shell and then run it again.... worked fine for me. (but then I'm
in England)
The discussion was here: http://swish-e.org/archive/4870.html
Cheers
John
Bill Moseley wrote:
>On Wed, Sep 03, 2003 at 01:02:53PM -0700, Thomas Dowling wrote:
>
>
>
>>I am trying to use SWISH-E (I've tried both 2.2.3 and 2.4.0 pr1) to
>>spider our website. Following directions in the documentation, I set up
>>a basic swish.conf and spider.conf, and my indexing run always bombs
>>with the message:
>>
>>err: External program failed to return required headers Path-Name: &
>>Content-Length:
>>
>>I found what appeared to be an identical problem report in the list
>>archives from last April (<http://swish-e.org/archive/5149.html>), but
>>didn't see a definitive solution posted there. None of the suggestions
>>offered there affect the problem here.
>>
>>
>
>That error message is typically because the length is set wrong on the
>*previous* document and then when swish-e tries to read the document
>it's reading in the wrong place in the stream.
>
>
>
>>I took the liberty of inserting a line into spider.pl to print out the
>>headers, and every document it reports on does have Path-Name and
>>Content-Length headers, which makes me suspect the problem is either
>>with swish-e itself or in the interaction between spider.pl and swish-e.
>>
>>
>
>I often do things the hard way. For example, I've taken the output from
>spider.pl to a file, then one-by-one extract out each document and
>verify that its content-length is indeed its byte length.
>
>The problem is (may be depending on the version of Perl and the LANG
>setting) that spider.pl uses length() to set the content-length
>header, but for multi-byte chars (which swish-e won't support) the
>length() and the size of the data can be two different things. So I
>have also edited spider.pl, and where it grabs the length() I have
>written out the file to disk and then stat'ed the check if the length is
>the same as the file size.
>
>
>
>>I've tried this against multiple web sites. The number of files scanned
>>before the indexing run dies varies from site to site, but is consistent
>>on each site. FWIW, I'm running swish-e under RedHat 8.0 with Perl
>>5.8.0 (and, if I'm reading things correctly, LWP 5.65).
>>
>>
>
>I think it was RedHat 9 where the default LANG is UTF-8. There have
>been problems reported in this case. I'm not sure if it applies to RH
>8.0.
>
>Assuming that this is a multi-byte character problem:
>
>There's is some code in spider.pl's output_content() function
>that was suppose to fix this:
>
> # ugly and maybe expensive, but perhaps more portable than "use bytes"
> my $bytecount = length pack 'C0a*', $$content;
>
>$ perl -le '$x=chr(400); print length pack "C0a*", $x'
>2
>
>Here's with "use bytes;" pragma.
>
>$ perl -le '$x=chr(400); print length $x'
>1
>
>$ perl -le '$x=chr(400); use bytes; print length $x'
>2
>
>
>
>
>
Received on Wed Sep 3 21:27:43 2003