I've made some minor changes to "spider.pl" that I've found very
useful so I thought I'd pass them on. Basically I converted
"spider.pl" to a perl module, making the in-line code a subroutine
and adding a simple "sub init" to clear the variables for each pass.
This allows the script to be used basically "as-is" by calling
use lib qw(./);
use Swish::Spider;
Swish::Spider::init;
Swish::Spider::run_spider;
from within a small script
The advantages come when you write your own "steering" routine to
replace "run_spider" with something different.
I run this from a perl script that is similar to the original in-line
code but includes more logic to examine the last modified date of the
previous index and modify the criteria for "test_url" for each site
that is indexed and optionally using individual config files for each
site when spidered. This lets me use current "spider.pl's" with
minimum changes as they are updated yet preserve the code written for
our spidering script.
I've included some examples from our scripts.
in the spider.pl doc it says....
swish-e -S prog -c swish.config
perl spider.pl | swish-e -S prog -c swish.conf -i stdin
our scripts do....
# somewhere else....
foreach($indx) {
my $cf = $swishlib .'/'. $prefix . '.config';
# checkfor for individual config file
$cf = $swishlib .'/'. $swconfig unless -e $cf
my $i = qq($swish $verbose -S prog -c $cf -i stdin -f $indx);
local *SAVOUT;
open SAVOUT, ">&STDOUT";
open STDOUT, "|$i";
foreach(@urls) {
$servers{base_url} = $_;
&Swish::Spider::process_server( $s );
}
close STDOUT;
open STDOUT, ">&SAVOUT";
and in test_url
# elements of z@{$s->{must_match}} = one of:
# Note: all values may contain 'regexps'
# 1) rewrite url starts with '~' and is of the form
# '~s/string_one[a-z]+/string_two/'
# 2) URI must not contain (starts with '!')
# '!SESSID'
# 3) URI must contain
# 'some_path'
sub test_url {
my ($url,$s) = @_;
# exclude images
return 0 if ($url =~ m|/[a-zA-Z0-9\.\_\-]+\.($non_text)[\?#;]*|io);
# special URL conditions
foreach (@{$s->{must_match}}) {
if ($_ =~ /^~/) { # re-write url
my $exp = '$url->opaque($u) if $u =~ ' . $';
my $u = $url->opaque;
eval $exp;
} elsif ($_ =~ /^!(.*)/) { # must not contain
return 0 if $url =~ /$1/;
} else { # must contain
return 0 unless $url =~ /$_/;
}
}
return 1;
}
DIFF for spider.pl
# diff Spider.pm spider.pl
2d1
< package Swish::Spider;
66,83c65
< use vars qw(
< # global -- I suppose would be smarter to localize it per server.
< $abort
< %visited
< %validated
< %bad_links
< );
<
< sub init {
< $abort = 0; < %visited = ();
< %validated = ();
< %bad_links = ();
< }
<
< sub run_spider {
<
< my @servers;
---
> use vars '@servers';
95a78
> my $abort;
97a81,85
> my %visited; # global -- I suppose would be smarter to localize it per server.
>
> my %validated;
> my %bad_links;
>
119c107
< }
---
>
Hope this is of some use to others.
Michael@Insulin-Pumpers.org
Received on Thu Aug 1 18:41:17 2002