Skip to main content.
home | support | download

Back to List Archive

Re: Problems with spider.pl on windows 98 SE

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Wed Feb 13 2002 - 14:08:25 GMT
At 10:35 PM 02/12/02 -0800, Adam Edelman wrote:
>I'm having trouble gettting swish to index files once they have been through
>spider.pl.  Indexing has worked using a practically identical config file
>and swishspider.pl.  I've also tried the spider.pl from the swish version
>i'm working with and with the newest version from 2/12/02. I have perl
>5.6.1.  Any assistance would be appreciated.  The relevent info follows.

Thanks very much for posting such a helpful post.  Make helping much easier.

So easy, in fact, that it works as-is on my machine.  I just now downloaded
the Feb 7, 2002 binary version onto Win98.

E:\Program Files\SWISH-E>type SwishSpiderConfig.pl
@servers = (
    {
        base_url    => 'http://arena.internet2.edu/sample.htm',
        email       => 'swish@tulane.edu',
        delay_min   => .001,
        #max_time    => 10,        # Max time to spider in minutes
        max_files   => 2,       # Max Unique URLs to spider
        max_indexed => 1,
         test_url        => \&test_url,
        test_response   => \&test_response,
        filter_content  => \&filter_content,
 debug => DEBUG_URL | DEBUG_SKIPPED | DEBUG_FAILED | DEBUG_INFO
    },);
sub test_url {
    my ( $uri, $server ) = @_;
    return $uri->path =~ /\.html?$/;
}
sub test_response {
    my ( $uri, $server, $response ) = @_;
    return 1;  # ok to index and spider
}
sub filter_content {
   my ( $uri, $server, $response, $content_ref ) = @_;
    return 1;
}
1;


E:\Program Files\SWISH-E>type test.txt
BumpPositionCounterCharacters |.
MaxWordLimit 80
WordCharacters abcdefghijklmnopqrstuvwxyz0123456789.-_
IndexReport 3
IgnoreTotalWordCountWhenRanking yes
#IgnoreWords file: c:\\swish-e\\conf\\stopwords\\english.txt
IndexDir e:\\perl\\bin\\perl.exe
SwishProgParameters prog-bin/spider.pl


E:\Program Files\SWISH-E>swish-e -S prog -c test.txt
Indexing Data Source: "External-Program"
Indexing "e:\perl\bin\perl.exe"
prog-bin/spider.pl: Reading parameters from 'SwishSpiderConfig.pl'

 -- Starting to spider: http://arena.internet2.edu/sample.htm --
?Testing 'test_url' user supplied function #1
'http://arena.internet2.edu:80/sample.htm'
+Passed all 1 tests for 'test_url' user supplied function
?Testing 'test_response' user supplied function #1
'http://arena.internet2.edu:8
0/sample.htm'
+Passed all 1 tests for 'test_response' user supplied function
>> +Fetched 0 Cnt: 1 http://arena.internet2.edu:80/sample.htm 200 OK
text/html 2
9 parent:
! Found 0 links in http://arena.internet2.edu:80/sample.htm

prog-bin/spider.pl: Max indexed files Reached

Summary for: http://arena.internet2.edu/sample.htm
Total Bytes: 29  (14.5/sec)
 Total Docs:  1  (0.5/sec)
Unique URLs:  1  (0.5/sec)
http://arena.internet2.edu:80/sample.htm - Using DEFAULT (HTML) parser -
(2 wor
ds)

Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 2 words alphabetically
Writing header ...
Writing index entries ...
  Writing word text: Complete
  Writing word hash: Complete
  Writing word data: Complete
2 unique words indexed.
4 properties sorted.
1 file indexed.  29 total bytes.  2 total words.
Elapsed time: 00:00:03 CPU time: 00:00:03
Indexing done!

Could there be some issue with Windows SE?  

Try running just the spider:

   perl prog-bin/spider.pl > out

then type out:

E:\Program Files\SWISH-E>type out
Path-Name: http://arena.internet2.edu:80/sample.htm
Content-Length: 29
Last-Mtime: 1013569857

<HTML>Sample document</HTML>


That way you can see if spider.pl is working correctly.

You might try adding a blank line in your sample document, just in case
that's causing problems.
-- 
Bill Moseley
mailto:moseley@hank.org
Received on Wed Feb 13 14:09:09 2002