Skip to main content.
home | support | download

Back to List Archive

spider.pl does not follow Metadata redirects

From: Robert Keith <Robert(at)not-real.technolords.com>
Date: Mon Jul 28 2003 - 08:43:03 GMT
I need to spider a web site that has a Metadata tag in the main index.html
that redirects the browser to index.php.

Example:
www.intellivence.com/index.html ->
    <meta http-equiv="refresh" content="0;
URL=http://www.intellivence.com/index.php">

This works fine via browsers (albeit slowly).
I know there are better ways to do this (use web server to set head html
correctly, etc.) , but we can't control foreign sites.  Should not the
spider system behave similar to browsers?


Is this the current behavior or did I miss something?

Regards,
Robert Keith

==============================================

The command I run is:

        /usr/bin/perl /fs/area/search/prog-bin/spider.pl
/fs/area/search/conf/prof2.pl | swish-e -S prog -c
/fs/area/search/conf/prof2 -i stdin -v3

The output is:

Parsing config file '/fs/area/search/conf/prof2'
Parsing config file '/fs/area/search/conf/common.config'
Indexing Data Source: "External-Program"
Indexing "stdin"
/fs/area/search/prog-bin/spider.pl: Reading parameters from
'/fs/area/search/conf/prof2.pl'

 -- Starting to spider: http://www.intellivence.com/ --
>> +Fetched 0 Cnt: 1 http://www.intellivence.com/ 200 OK text/html 130
parent:

Summary for: http://www.intellivence.com/
Total Bytes: 130  (130.0/sec)
 Total Docs:   1  (1.0/sec)
Unique URLs:   1  (1.0/sec)
http://www.intellivence.com/ - Using HTML parser -  (no words indexed)

Removing very common words...
no words removed.
Writing main index...
err: No unique words indexed!
Received on Mon Jul 28 08:43:56 2003