Skip to main content.
home | support | download

Back to List Archive

Re: Problem on Parser with TXT/HTML and Spider.pl

From: <moseley(at)not-real.hank.org>
Date: Wed Apr 30 2003 - 06:58:14 GMT
On Tue, Apr 29, 2003 at 03:30:23PM -0700, Robert Keith wrote:
> 
> I am having a strange problem indexing a combination of MSWord, .txt and PHP
> documents using spider.pl and feeding this into swish-e.  If I index the PHP
> urls first, the documents are parsed and loaded as HTML.  If I select the
> MSWord and other documents, which are filtered by the spider.pl filter
> routines, the MSWord and other documents are parsed as TXT (correctly), then
> when the subsequent PHP and HTML documents are parsed, they are parsed as
> TXT.  The SwishSpiderConfig.pl file contains two entries, the URL with the
> MSWord links, and the URL with only PHP links.

This is a better fix (I actually tried it this time!)

--- extprog.c.old       2003-04-29 23:51:34.000000000 -0700
+++ extprog.c   2003-04-29 23:52:04.000000000 -0700
@@ -272,7 +272,10 @@
 
             /* Set the doc type from the header */
             if ( docType )
+            {
                 fprop->doctype   = docType;
+                docType = 0;
+            }
 
 
             /* set real_path, doctype, index_no_content, filter, stordesc 
*/


That error doesn't show up on the dev version because the doctype is
set on all files instead of just the filtered ones.

Sorry for the trouble.
Received on Wed Apr 30 07:05:43 2003