> By the way,
>
> $URI::ABS_REMOTE_LEADING_DOTS = 1;
>
> at the top of my spider config file does seem to fix it.
I put that up there, and I still get .., but it doesn't look like I get as
many. I ran the site through the w3c link checker, and there are broken
links all over the place on that site.
I spoke with the person in charge of the site, and she wants a seperate
search page for each directory. So, it looks like for a few of the
directories, I'll need a seperate swish for. Also, she didn't seem willing
to change all the pages to have a metatag, so it looks like swish-e will
have to do all the work.
Here's the main chunk of my SwishSpider.pl:
@servers = (
skip => 0, # skip spidering this server
debug => DEBUG_URL, # print some debugging info to STDERR
base_url => 'http://www.oshkoshpubliclibrary.org/citydirs/',
email => 'swish@domain.invalid',
delay_min => .0001,
link_tags => [qw/ a frame /],
max_files => 2500,
max_indexed => 2500, # Max number of files to send to
swish for indexing
max_size => 100_000_000_000, # limit to 1MB file size
max_depth => 10, # spider only ten levels deep
keep_alive => 1,
test_url => sub {
my $uri = shift;
return if $uri->path =~ /\.(gif|jpeg)$/;
return $uri->path =~ m[^/citydirs/];
},
test_response => sub {
my $content_type = $_[2]->content_type;
my $ok = grep { $_ eq $content_type } qw{ text/html text/plain
application/pdf application/msword };
# This might be used if you only wanted to index PDF files, yet
spider still spider.
#$_[1]->{no_index} = $content_type ne 'application/pdf';
return 1 if $ok;
print STDERR "$_[0] wrong content type ( $content_type )\n";
return;
},
filter_content => [ \&pdf, \&doc ],
},
As it is now, it searches all of www.oshkoshpubliclibrary.org. Now, I've got
/citydirs/ in the test_url part. For the base_url, should that stop at .org,
or should it also contain /citydirs/? And, am I missing something else, or
have a typo? Either way I do it, it still indexes everything at the site.
Jody
Received on Tue Apr 15 19:07:05 2003