I'm having trouble gettting swish to index files once they have been through
spider.pl. Indexing has worked using a practically identical config file
and swishspider.pl. I've also tried the spider.pl from the swish version
i'm working with and with the newest version from 2/12/02. I have perl
5.6.1. Any assistance would be appreciated. The relevent info follows.
Thanks a lot!
Adam Edelman
--------
swish-e 2.1-dev-25 compiled on Jan 19 2002
Trying to index the following document:
<HTML>Sample document</HTML>
I've already tried larger documents and multiple documents with no success.
Config file (SwishSpiderConfig.pl):
@servers = (
{
base_url => 'http://arena.internet2.edu/sample.htm',
email => 'swish@tulane.edu',
delay_min => .001,
#max_time => 10, # Max time to spider in minutes
max_files => 2, # Max Unique URLs to spider
max_indexed => 1,
test_url => \&test_url,
test_response => \&test_response,
filter_content => \&filter_content,
debug => DEBUG_URL | DEBUG_SKIPPED | DEBUG_FAILED | DEBUG_INFO
},);
sub test_url {
my ( $uri, $server ) = @_;
return $uri->path =~ /\.html?$/;
}
sub test_response {
my ( $uri, $server, $response ) = @_;
return 1; # ok to index and spider
}
sub filter_content {
my ( $uri, $server, $response, $content_ref ) = @_;
return 1;
}
1;
Config file (test.txt):
BumpPositionCounterCharacters |.
MaxWordLimit 80
WordCharacters abcdefghijklmnopqrstuvwxyz0123456789.-é
IndexReport 3
IgnoreTotalWordCountWhenRanking yes
IgnoreWords file: c:\\swish-e\\conf\\stopwords\\english.txt
IndexDir c:\\perl\\bin\\perl.exe
SwishProgParameters c:\\swish-e\\spider.pl
And the output:
c:\swish-e>swish-e -c test.txt -S prog -v 3
Indexing Data Source: "External-Program"
Indexing "c:\perl\bin\perl.exe"
c:\swish-e\spider.pl: Reading parameters from 'SwishSpiderConfig.pl'
-- Starting to spider: http://arena.internet2.edu/sample.htm --
?Testing 'test_url' user supplied function #1
'http://arena.internet2.edu:80/sample.htm'
+Passed all 1 tests for 'test_url' user supplied function
?Testing 'test_response' user supplied function #1
'http://arena.internet2.edu:80/sample.htm'
+Passed all 1 tests for 'test_response' user supplied function
>> +Fetched 0 Cnt: 1 http://arena.internet2.edu:80/sample.htm 200 OK
text/html 29 parent:
! Found 0 links in http://arena.internet2.edu:80/sample.htm
Path-Name: http://arena.internet2.edu:80/sample.htm
Content-Length: 29
Last-Mtime: 1013569857
<HTML>Sample document</HTML>c:\swish-e\spider.pl: Max indexed files Reached
Summary for: http://arena.internet2.edu/sample.htm
Total Bytes: 29 (29.0/sec)
Total Docs: 1 (1.0/sec)
Unique URLs: 1 (1.0/sec)
Removing very common words...
no words removed.
Writing main index...
err: No unique words indexed!
.
Received on Wed Feb 13 06:37:21 2002