Skip to main content.
home | support | download

Back to List Archive

Problems with spider.pl on windows 98 SE

From: Adam Edelman <aedelma(at)not-real.tulane.edu>
Date: Wed Feb 13 2002 - 06:36:47 GMT
I'm having trouble gettting swish to index files once they have been through
spider.pl.  Indexing has worked using a practically identical config file
and swishspider.pl.  I've also tried the spider.pl from the swish version
i'm working with and with the newest version from 2/12/02. I have perl
5.6.1.  Any assistance would be appreciated.  The relevent info follows.
Thanks a lot!

Adam Edelman
--------
swish-e 2.1-dev-25 compiled on Jan 19 2002

Trying to index the following document:
<HTML>Sample document</HTML>
I've already tried larger documents and multiple documents with no success.

Config file (SwishSpiderConfig.pl):
@servers = (
    {
        base_url    => 'http://arena.internet2.edu/sample.htm',
        email       => 'swish@tulane.edu',
        delay_min   => .001,
        #max_time    => 10,        # Max time to spider in minutes
        max_files   => 2,       # Max Unique URLs to spider
        max_indexed => 1,
         test_url        => \&test_url,
        test_response   => \&test_response,
        filter_content  => \&filter_content,
 debug => DEBUG_URL | DEBUG_SKIPPED | DEBUG_FAILED | DEBUG_INFO
    },);
sub test_url {
    my ( $uri, $server ) = @_;
    return $uri->path =~ /\.html?$/;
}
sub test_response {
    my ( $uri, $server, $response ) = @_;
    return 1;  # ok to index and spider
}
sub filter_content {
   my ( $uri, $server, $response, $content_ref ) = @_;
    return 1;
}
1;

Config file (test.txt):
BumpPositionCounterCharacters |.
MaxWordLimit 80
WordCharacters abcdefghijklmnopqrstuvwxyz0123456789.-
IndexReport 3
IgnoreTotalWordCountWhenRanking yes
IgnoreWords file: c:\\swish-e\\conf\\stopwords\\english.txt
IndexDir c:\\perl\\bin\\perl.exe
SwishProgParameters c:\\swish-e\\spider.pl

And the output:
c:\swish-e>swish-e -c test.txt -S prog -v 3
Indexing Data Source: "External-Program"
Indexing "c:\perl\bin\perl.exe"
c:\swish-e\spider.pl: Reading parameters from 'SwishSpiderConfig.pl'
-- Starting to spider: http://arena.internet2.edu/sample.htm --
?Testing 'test_url' user supplied function #1
'http://arena.internet2.edu:80/sample.htm'
+Passed all 1 tests for 'test_url' user supplied function
?Testing 'test_response' user supplied function #1
'http://arena.internet2.edu:80/sample.htm'
+Passed all 1 tests for 'test_response' user supplied function
>> +Fetched 0 Cnt: 1 http://arena.internet2.edu:80/sample.htm 200 OK
text/html 29 parent:
! Found 0 links in http://arena.internet2.edu:80/sample.htm
Path-Name: http://arena.internet2.edu:80/sample.htm
Content-Length: 29
Last-Mtime: 1013569857
<HTML>Sample document</HTML>c:\swish-e\spider.pl: Max indexed files Reached
Summary for: http://arena.internet2.edu/sample.htm
Total Bytes: 29 (29.0/sec)
Total Docs:   1 (1.0/sec)
Unique URLs:   1 (1.0/sec)
Removing very common words...
no words removed.
Writing main index...
err: No unique words indexed!
.
Received on Wed Feb 13 06:37:21 2002