At 09:41 AM 11/17/02 -0800, Leif Larsson wrote:
>SwishProgParameters default http://bim.ce.kth.se http://www.svktf.se
>etc... etc... (lots of sites)
>Bad directive on line #2 of file sites.txt: /www.euronom.se http://www.e
>urovac.nu etc... etc...
>
>Seems to me there is a 2000 character limit. As soon as the included
>"sites.txt" file grows over this limit, spider.pl barfs on me.
>
>Am i missing something ?
Nope, there's a line length limit in the config file.
Instead use a config file for the spider. Since it's Perl it give you a
lot of options.
One way would be to do this in a spider config file (not tested, but
probably not too far off...)
# Define list of servers to spider
my @server_list = qw(
http://bim.ce.kth.se
http://www.svktf.se
...
);
@servers = (
{
base_url => \@server_list,
email => 'leif.larsson@l3system.se',
delay_min => .0001,
link_tags => [qw/ a frame /],
test_url => sub { $_[0]->path !~ /\.(?:gif|jpeg|png)$/i },
test_response => sub {
my $content_type = $_[2]->content_type;
my $ok = grep { $_ eq $content_type } @content_types;
return 1 if $ok;
print STDERR "$_[0] $content_type != (@content_types)\n";
return;
},
},
);
1;
In case you want different settings for each host, you might be better off
doing something like:
# Define list of servers to spider
my @server_list = qw(
http://bim.ce.kth.se
http://www.svktf.se
...
);
my %spider_config = (
email => 'leif.larsson@l3system.se',
delay_min => .0001,
link_tags => [qw/ a frame /],
test_url => sub { $_[0]->path !~ /\.(?:gif|jpeg|png)$/i },
test_response => sub {
my $content_type = $_[2]->content_type;
my $ok = grep { $_ eq $content_type } @content_types;
return 1 if $ok;
print STDERR "$_[0] $content_type != (@content_types)\n";
return;
},
);
for ( @server_list )
my %this_host = %spider_config;
$this_host{base_url} = $_;
# maybe set "same_hosts" settings for each server
push @servers, \%this_host;
}
1;
Then you can do things like get the server list from a file (seems silly to
have another file, though) or a database or whatever.
--
Bill Moseley
mailto:moseley@hank.org
Received on Sun Nov 17 18:15:16 2002