Skip to main content.
home | support | download

Back to List Archive

Spider taking too long to index?

From: David VanHook <dvanhook(at)not-real.mshanken.com>
Date: Tue Oct 08 2002 - 13:50:29 GMT
Hello -- at the good suggestions of Bill and others, I've decided to make a
go of it with spider.pl.  It works -- but it seems to be taking
significantly longer than the times other people on this list have reported.

Last night, during a time when the site was not very busy at all, it took
spider.pl 3 hours and 16 minutes to index 19,277 files (a rate of 1.6 per
second, according to the SWISH report).  The total amount of CPU utilization
time was 22 minutes, 20 seconds.

The way I'm doing it is, I feed Spider.pl a single page which contains a
list of links to all the pages I want it to index.  That page is huge, of
course.  Then I tell spider.pl to only go one level deep.  So it grabs the
first item on the page, indexes it, returns to the list, grabs the next
item,  indexes it, returns to the list, etc.  Is that not the way this
should work?  Should I modify some setting on SwishSpiderConfig.pl to
account for this system?

Here's the important config parts from the SwishSpiderConfig.pl:

        # limit to only .html files
        # test_url    => sub { $_[0]->path =~ /\.html?$/ },

        delay_min   => .0001,     # Delay in minutes between requests
        max_depth   => 1,
        max_time    => 300,        # Max time to spider in minutes
        max_files   => 30000,       # Max Unique URLs to spider
        max_indexed => 30000,        # Max number of files to send to swish
for indexing
        keep_alive  => 1,         # enable keep alives requests

Because I'm generating this list of items to index myself, I turned off the
test_url function.  But that didn't seem to help performance all that much.

I'm running this on a Netra T-1 with 256 megs of RAM and one 300 MHZ Sparc
processor.  So it's a decent machine, but nothing huge.  Is that the
problem?  Any other suggestions?  Our site is real fast, so it's not the
site's performance overall, I sure don't think.

Thanks --

Dave V.


===========================
David VanHook
Director of Technology
Wine Spectator Online
http://www.winespectator.com
dvanhook@mshanken.com
Received on Tue Oct 8 13:54:19 2002