On Wed, Jan 07, 2004 at 07:45:45AM -0800, Ander wrote:
> Hi all:
>
> I'm using spider.pl to index a list of servers, which I create
> dinamically (from a database). When we have 2500 documents indexed
> (more or less), spidering (and indexing, of course) stops.
I can't think of anything. You might be able to enable some of the
debugging options to watch the progress, but if it's quiting for normal
reasons it won't report anything.
I'll show a patch below that will print out the size of the array of
links in the queue (and to disable the default 5 second delay).
Run as (adjust for your shell):
$ SPIDER_DEBUG=url,links ./spider.pl default http://localhost >/dev/null 2>spider.out
Is it possible that it's running out of links to follow?
Is it possible that the spider is eating memory and the process is being
killed by process limits?
--- /usr/local/lib/swish-e/spider.pl 2003-12-13 14:13:03.000000000 -0800
+++ spider.pl 2004-01-07 08:24:07.000000000 -0800
@@ -247,9 +247,9 @@
$server->{delay_sec} = int ($server->{delay_min} * 60);
}
- $server->{delay_sec} = 5 unless defined $server->{delay_sec};
+ $server->{delay_sec} = 0 unless defined $server->{delay_sec};
}
- $server->{delay_sec} = 5 unless $server->{delay_sec} =~ /^\d+$/;
+ $server->{delay_sec} = 0 unless $server->{delay_sec} =~ /^\d+$/;
if ( $server->{ignore_robots_file} ) {
@@ -395,6 +395,7 @@
die $server->{abort} if $abort || $server->{abort};
my ( $uri, $parent, $depth ) = @{shift @link_array};
+print STDERR "Links left in array = " . scalar @link_array . "\n";
delay_request( $server );
--
Bill Moseley
moseley@hank.org
Received on Wed Jan 7 16:36:29 2004