Skip to main content.
home | support | download

Back to List Archive

Re: "External program failed to return required headers"

From: Aaron Bazar <aaronb(at)not-real.spamcop.net>
Date: Wed Sep 03 2003 - 21:58:27 GMT
Actually, I have it as "en_US" ...  sorry for the typo.


Best regards,

Aaron Bazar
http://www.petsuppliessearch.com






I had the same issues. The only way I could get it to work was to 
change the default language in RedHat. This can be done in your
own (spider.pl) environment or the entire server. Change it 
to US_en.

Best regards,

Aaron Bazar




-----Original Message-----
From: swish-e@sunsite.berkeley.edu
[mailto:swish-e@sunsite.berkeley.edu]On Behalf Of Bill Moseley
Sent: Wednesday, September 03, 2003 5:15 PM
To: Multiple recipients of list
Subject: [SWISH-E] Re: "External program failed to return required
headers"


On Wed, Sep 03, 2003 at 01:02:53PM -0700, Thomas Dowling wrote:

> I am trying to use SWISH-E (I've tried both 2.2.3 and 2.4.0  pr1)  to 
> spider our website.  Following directions in the documentation, I set up 
> a basic swish.conf and spider.conf, and my indexing run always bombs 
> with the message:
> 
> err: External program failed to return required headers Path-Name: & 
> Content-Length:
> 
> I found what appeared to be an identical problem report in the list 
> archives from last April (<http://swish-e.org/archive/5149.html>), but 
> didn't see a definitive solution posted there.  None of the suggestions 
> offered there affect the problem here.

That error message is typically because the length is set wrong on the
*previous* document and then when swish-e tries to read the document
it's reading in the wrong place in the stream.

> I took the liberty of inserting a line into spider.pl to print out the 
> headers, and every document it reports on does have Path-Name and 
> Content-Length headers, which makes me suspect the problem is either 
> with swish-e itself or in the interaction between spider.pl and swish-e.

I often do things the hard way.  For example, I've taken the output from 
spider.pl to a file, then one-by-one extract out each document and 
verify that its content-length is indeed its byte length.

The problem is (may be depending on the version of Perl and the LANG 
setting) that spider.pl uses length() to set the content-length 
header, but for multi-byte chars (which swish-e won't support) the 
length() and the size of the data can be two different things.  So I 
have also edited spider.pl, and where it grabs the length() I have 
written out the file to disk and then stat'ed the check if the length is 
the same as the file size.

> I've tried this against multiple web sites.  The number of files scanned 
> before the indexing run dies varies from site to site, but is consistent 
> on each site.  FWIW, I'm running swish-e under RedHat 8.0 with Perl 
> 5.8.0 (and, if I'm reading things correctly, LWP 5.65).

I think it was RedHat 9 where the default LANG is UTF-8.  There have 
been problems reported in this case.  I'm not sure if it applies to RH 
8.0.

Assuming that this is a multi-byte character problem:

There's is some code in spider.pl's output_content() function 
that was suppose to fix this:

    # ugly and maybe expensive, but perhaps more portable than "use bytes"
    my $bytecount = length pack 'C0a*', $$content;

$ perl -le '$x=chr(400); print length pack "C0a*", $x'
2

Here's with "use bytes;" pragma.

$ perl -le '$x=chr(400); print length $x'
1

$ perl -le '$x=chr(400); use bytes; print length $x'
2



-- 
Bill Moseley
moseley@hank.org
Received on Wed Sep 3 21:58:37 2003