On Tue, Jul 05, 2005 at 09:13:08AM -0700, Aliasgar Dahodwala wrote:
> I am running swish-e 2.4.3 on a redhat linux box. I am using the
> included spider.pl script to spider my website.
>
> My problem: When i enable the keep_alive directive of the spider program
> and set the delay_sec to 5, the spider fetches the pages at blazing
> speed ignoring the delay_sec directive, and after going through around
> 5000 pages it then catches up on all the delay, it stops fetching any
> more pages and just keeps sleeping for 5 seconds each. After a long wait
> it continues from where it left off.
Sounds like a bug. By design it ignores the delay_sec setting in a
keep alive connection. The point of the keep alive is to allow faster
requests -- avoiding the time required to start up the new connection.
>From the docs:
# delay_sec
This optional key sets the delay in seconds to wait between
requests. See the LWP::RobotUA man page for more information. The
default is 5 seconds. Set to zero for no delay.
When using the keep_alive feature (recommended) the delay will be
used only where the previous request returned a "Connection:
closed" header.
So after fetching 5000 docs (is that your MaxKeepAliveRequests set to
5000?) you are saying that the spider delays delay_sec seconds x 5000
before it fetches any more documents?
Let's see, the wait time is set here:
my $wait = $server->{delay_sec} - ( time - $server->{last_response_time} );
return unless $wait > 0;
sleep( $wait );
That last_response_time is the time the last request was completed,
which should normally be almost the same as the current time, so you
end up with delay_sec. So I don't see how it could be delaying more
than delay_sec.
Is that what you mean?
--
Bill Moseley
moseley@hank.org
Unsubscribe from or help with the swish-e list:
http://swish-e.org/Discussion/
Help with Swish-e:
http://swish-e.org/current/docs
swish-e@sunsite.berkeley.edu
Received on Tue Jul 5 11:18:07 2005