Skip to main content.
home | support | download

Back to List Archive

spider bug?

From: Mark Morgan <mark(at)not-real.zaneray.com>
Date: Tue Oct 05 2004 - 16:20:56 GMT
I'm trying to index a client site, www.e-caps.com.  I'm using 2.5.2, and
have tried 2.4.2, with the same results.  Some pages are OK, but one is
confusing spider.pl.  I get:

Parsing config file 'e-caps.conf'
Indexing Data Source: "External-Program"
Indexing "spider.pl"
External Program found: /usr/local/lib/swish-e/spider.pl
/usr/local/lib/swish-e/spider.pl: Reading parameters from 'default'
http://www.e-caps.com/za/ECP?PAGE=ABOUT_US - Using HTML2 parser -  (470
words)
http://www.e-caps.com/za/ECP?PAGE=HOME - Using HTML2 parser -  (409 words)
http://www.e-caps.com/za/ECP?PAGE=PRODUCTS_MAIN - Using HTML2 parser -  (140
words)
http://www.e-caps.com/za/ECP?PAGE=KNOWLEDGE - Using HTML2 parser -  (387
words)

Warning: Unknown header line: 'tml>Path-Name:
http://www.e-caps.com/za/ECP?PAGE=FDA_DISCLAIMER&OMI=10093,10072&AMI=10093'
from program spider.pl
err: External program failed to return required headers Path-Name:


The knowledge page passes html validation as far as structure, yet for some
reason, it's leaving the spider with the extraneous 'tml>' string.

My config is:

    # Configuration file for spidering the e-caps site
    # Use the "spider.pl" program included with Swish-e
    IndexDir spider.pl

    # Define what site to index
    SwishProgParameters default http://www.e-caps.com/za/ECP?PAGE=ABOUT_US

and the command is:

swish-e  -S prog -c e-caps.conf -v9



Other pages on the site, as you can see in the first few, go OK, but for
some reason, the knowledge page makes it blow chunks.  Anyone have any
ideas?  If I run with -S http, it goes OK, but I need to use prog, as we
have a bunch of PDF files that we want to index.


|
|  Mark Morgan
|  Senior Programmer/Analyst
|  T H E   Z A N E R A Y   G R O U P ,  I N C .
|
|  mark@zaneray.com
|
|  25 O'Brien Avenue
|  Whitefish, MT 59937
|  406.863.8000
|
|  http://www.zaneray.com
|
Received on Tue Oct 5 09:21:16 2004