Skip to main content.
home | support | download

Back to List Archive

Re: Spider.pl problem(s) on Linux (and other UNIXen?)

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Fri May 23 2003 - 19:57:10 GMT
On Fri, May 23, 2003 at 10:50:00AM -0700, Greg Fenton wrote:
> SWISH-E 2.2.3 on RedHat 7.3
> 
> A co-worker of mine was trying to crawl a website very, very slowly (to
> limit impact on a production website).
> 
> He set in his spider.conf:
> 
>     base_url => "http://production_webserver",
>     delay_min => 0.5,
> 
> and has standard test_url and test_response subroutines.
> 
> The config works just fine on his W2K workstation, but we wanted to run
> it on a real system so we put it on one of our Linux servers.  The
> crawl failed with a Perl error:
> 
>   is_success not defined on line 477 of spider.pl
> 
> Digging through it, I came to the conclusion that this is a problem
> with the alarm() code [which does not fire on Win32 platforms].  The
> hard-coded defaults for two different params is 30 seconds (which is
> what our delay_min is set to).
> 
> By adding the following to our server configuration, we were able to
> successfully crawl:
> 
>     max_wait_time => 60,
> 
> There are two issues here then:
> 
> 1. Documentation for delay_min should reference max_wait_time

Yes, it should.


> 2. In the event that an alarm does go off, the code currently
>    crashes.  It would be nice if there was at least a message
>    indicating the source of the error and possibly a
>    suggestion of how to resolve it.  It would also be nice
>    to configure how to handle such alarms (ON_ALARM_EXIT,
>    ON_ALARM_RETRY, ON_ALART_SKIP_URL), etc...

Thanks for finding this Greg.  I need to look into this more, because I'm not sure if this    
is a problem with LWP::RobotUA or not.  

Under a normal LWP request if you die() inside the request LWP it will still return a 
request object.  From the LWP::UserAgent docs:

       The request can be aborted by calling die() in the callback routine.
       The die message will be available as the "X-Died" special response
       header field.

And indeed it works:

   -- Starting to spider: http://bumby/apache/swish.cgi --
>> -Failed 0 Cnt: 1 http://bumby/apache/swish.cgi 500 timed out

What's happening is the Alarm is firing while LWP::RobotUA is sleeping, and it's not 
returning a response object.

One fix might be to make sure that max_wait_time > delay_min.  Another fix might be to not 
set an alarm for the max_wait_time and instead use the $ua->timeout option.  The problem 
I've had with $ua->timeout is on some platforms things like DNS lookups can block forever 
and the $ua->timeout doesn't trigger.  I do use $ua->timeout on Windows because no alarm on 
Windows.

Also, good question on what to do on alarm timeouts.  I've always considered them as just 
errors returned from the server.  Of course, that's not always true.

I'll update docs and add code to catch this case.

BTW -- this problem you found may not be an issue anymore because in the current spider.pl 
code I do not use the $ua->delay() in the robot, and instead handle the delay inside 
spider.pl.  The reason for this change is that the spider only delays on Connection: close, 
but does not delay on Connection: keep-alive responses.


-- 
Bill Moseley
moseley@hank.org
Received on Fri May 23 19:57:17 2003