Skip to main content.
home | support | download

Back to List Archive

Re: Problem on Parser with TXT/HTML and Spider.pl

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Tue Apr 29 2003 - 23:14:23 GMT
On Tue Apr 29, 2003 at 03:30:23PM -0700, Robert Keith wrote:
> 
> I am having a strange problem indexing a combination of MSWord, .txt and PHP
> documents using spider.pl and feeding this into swish-e.  If I index the PHP
> urls first, the documents are parsed and loaded as HTML.  If I select the
> MSWord and other documents, which are filtered by the spider.pl filter
> routines, the MSWord and other documents are parsed as TXT (correctly), then
> when the subsequent PHP and HTML documents are parsed, they are parsed as
> TXT.  The SwishSpiderConfig.pl file contains two entries, the URL with the
> MSWord links, and the URL with only PHP links.

Just to narrow things down, if you save the output from spider.pl to a file does it contain 
the header to set the parser type?  That is, is spider.pl adding a

   Document-Type:

header?  I think that code is new, so I'm not sure what you are using.  And if so can you
check between the two indexing methods if they are set incorrectly?

You can also turn on DEBUG_HEADERS ( debug => DEBUG_HEADERS ) in the config and watch what 
content-type is being returned.

If it's not setting that header then we need to look at how swish is selecting the parser 
(which is based on extension as set by IndexContents and DefaultContents.
Received on Tue Apr 29 23:18:07 2003