Skip to main content.
home | support | download

Back to List Archive

Re: Spider taking too long to index?

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Tue Oct 08 2002 - 14:20:53 GMT
On Tue, 8 Oct 2002, David VanHook wrote:

> Last night, during a time when the site was not very busy at all, it took
> spider.pl 3 hours and 16 minutes to index 19,277 files (a rate of 1.6 per
> second, according to the SWISH report).  The total amount of CPU utilization
> time was 22 minutes, 20 seconds.
> 
> The way I'm doing it is, I feed Spider.pl a single page which contains a
> list of links to all the pages I want it to index.  That page is huge, of
> course.  Then I tell spider.pl to only go one level deep.  So it grabs the
> first item on the page, indexes it, returns to the list, grabs the next
> item,  indexes it, returns to the list, etc.  Is that not the way this
> should work?  Should I modify some setting on SwishSpiderConfig.pl to
> account for this system?

Yes, but it's only fetching one doc at a time, which might be slow.  It
took you 11,760 seconds to fetch 19,277 files.  

That can be faster if keep_alives are working.  It requires both that you
have a server set to do keep alives, of course, and also that the
spidering machine has a current version of LWP installed.  I think it will
complain if you set keep_alive and do not have the supporting LWP code,
though.

You should be able to check if keep alive is working on the server:

 > cat t.pl
@servers = (
    {

        base_url    => 'http://apache.org',
        email       => 'swish@domain.invalid',
        max_files   => 1,
        keep_alive  => 1,         # enable keep alives requests
    },

);


> SPIDER_DEBUG=headers ./spider.pl t.pl >/dev/null

----HEADERS for http://apache.org ---
Cache-Control: max-age=86400
Connection: Keep-Alive                 <<<<<
Date: Tue, 08 Oct 2002 14:11:58 GMT
Accept-Ranges: bytes
Server: Apache/2.0.43 (Unix)
Content-Length: 7751
Content-Type: text/html
Content-Type: text/html; charset=iso-8859-1
Expires: Wed, 09 Oct 2002 14:11:58 GMT
Client-Date: Tue, 08 Oct 2002 14:12:01 GMT
Client-Response-Num: 1
Keep-Alive: timeout=5, max=100         <<<<
Title: Welcome! - The Apache Software Foundation
X-Meta-Author: ASF
X-Meta-Email: apache@apache.org


If it says:

  Connection: close

then keep alives are not working.

If you are spidering more than one site the set the keep_alive value to a
larger number -- that setting is the number of connection cache entries
LWP maintains.

If your web server has any monitoring features maybe that will show if
keep alive requests are working.  In Apache you can use mod_status to
monitor the server.

If keep alives are not working fast enough and you don't mind hitting the
web server harder, then there's ways to do parallel fetching, but that
would require a rewrite of the spider.pl program.

> Because I'm generating this list of items to index myself, I turned off the
> test_url function.  But that didn't seem to help performance all that much.

No, it wouldn't -- all the time is probably in the either the connection
process, or in the transfer of data.  You can see from the huge difference
CPU time vs running time that the program is mostly waiting for I/O.




-- 
Bill Moseley moseley@hank.org
Received on Tue Oct 8 14:24:37 2002