SWISH-E 2.2.3 on RedHat 7.3
A co-worker of mine was trying to crawl a website very, very slowly (to
limit impact on a production website).
He set in his spider.conf:
base_url => "http://production_webserver",
delay_min => 0.5,
and has standard test_url and test_response subroutines.
The config works just fine on his W2K workstation, but we wanted to run
it on a real system so we put it on one of our Linux servers. The
crawl failed with a Perl error:
is_success not defined on line 477 of spider.pl
Digging through it, I came to the conclusion that this is a problem
with the alarm() code [which does not fire on Win32 platforms]. The
hard-coded defaults for two different params is 30 seconds (which is
what our delay_min is set to).
By adding the following to our server configuration, we were able to
successfully crawl:
max_wait_time => 60,
There are two issues here then:
1. Documentation for delay_min should reference max_wait_time
2. In the event that an alarm does go off, the code currently
crashes. It would be nice if there was at least a message
indicating the source of the error and possibly a
suggestion of how to resolve it. It would also be nice
to configure how to handle such alarms (ON_ALARM_EXIT,
ON_ALARM_RETRY, ON_ALART_SKIP_URL), etc...
I could look into adding such an enhancement if desired, though like
always, time may be an issue. I don't want to go off working on this
code if someone else "owns" development of the spider (or if the 2.2.3
spider.pl code is going away in a future release).
=====
Greg Fenton
greg_fenton@yahoo.com
__________________________________
Do you Yahoo!?
The New Yahoo! Search - Faster. Easier. Bingo.
http://search.yahoo.com
Received on Fri May 23 17:50:42 2003