Skip to main content.
home | support | download

Back to List Archive

Re: Indexing/Spider problem found and fixed

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Sun Jan 26 2003 - 18:58:42 GMT
On Fri, 24 Jan 2003, Smith, Doug wrote:

> I've spend several frustrating hours debugging an index job that uses
> spider.pl, and having found the solution, I thought I'd share it to
> save others the trouble.  I have a site of about 1,000 links, mostly
> HTML and PDF files.  I used the built-in spider.conf and the filter as
> recommended in the docs.  (swish-e 2.2.3, RedHat 8.0 - 2.4.18.)  It
> worked wonderfully on the development server, then failed on the new
> production server (of course).  The spider process failed on several
> of the PDF files, with a message "err: External program failed to
> return required headers Path-Name: & Content-Length:".

That noramlly means that the content-length of the previous "file" sent to
swish was not correct.

> I took one of the offending PDFs and ran it through pdf2html.pm.  
> That failed too, on a "tr / ..." line 201.  After much hunting I
> discovered that the LANG environment variable on the production server
> was "en_US.UTF-8", while the dev server was simply "en_US".  When I
> removed the "UTF-8" from the production box, it worked great!  So, it
> appears that pdf2html.pm wants to do its transliteration in Unicode
> rather than UTF-8, at least, that's my uneducated guess.

So what was happening?  When you say "fail" did Perl give an error
message?





-- 
Bill Moseley moseley@hank.org
Received on Sun Jan 26 18:59:11 2003