It looks like there is a bug in spider.pl. An attempt to set a no_index
attribute on the base_url using the test_url function fails.
I don't want the base_url page indexed as all the useful info on that
page is included in the articles underneath it. However if that page
shows article summaries, then it will often return a higher score
than the page with the real info.
I think the bug is in line 378 of process_link. You are resetting the
no_index key to zero, but you have already called test_url once while
processing the base_url. I added a test so that the lines:
# Really should just subclass the response object!
$server->{no_contents} = 0;
$server->{no_index} = 0;
$server->{no_spider} = 0;
now read:
# Really should just subclass the response object!
$server->{no_contents} = 0 if $server->{counts}{'Unique URLs'} > 1;
$server->{no_index} = 0 if $server->{counts}{'Unique URLs'} > 1;
$server->{no_spider} = 0 if $server->{counts}{'Unique URLs'} > 1;
An example config entry is:
{
skip => 0, # skip spidering this server
base_url => 'some url here',
agent => 'swish-e spider http://swish-e.org/',
email => 'swish@domain.invalid',
# limit to real articles
test_url => sub {
$server=$_[1];
$server->{no_index}++ if $_[0]->path =~ m#/search.asp#;
$_[0]->path =~ /\/article.asp$/ || $_[0]->path =~ /\/search.asp$/
},
delay_min => .0001, # Delay in minutes between requests
max_time => 60, # Max time to spider in minutes
},
Where the top level page is a search/index page for all pages at a
site.
-- rouilj
John Rouillard
===============================================================================
My employers don't acknowledge my existence much less my opinions.
Received on Tue Aug 13 22:27:07 2002