Skip to main content.
home | support | download

Back to List Archive

Re: request delay problem with spider.pl

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Tue Jul 05 2005 - 20:59:00 GMT
On Tue, Jul 05, 2005 at 03:19:01PM -0400, Aliasgar Dahodwala wrote:
> what i failed to find out is, why does the spider sleep, something 
> around 5000 x delay_sec after fetching somewhere around 5824 files.
> (the exact count value is 5824).  In the debug file i have that many 
> "sleeping 5 seconds" messages, before the spider starts fetching again.
> 
> so i am thinking there is a bug in there somehwhere.

Sounds like it.  What's magic about 5824, I wonder.  In my version of
spider.pl delay_request() is called inside the spider() function.
It's not the best place to call delay_request() because it's not
really making the request at that point (test_url could skip the
request, for example).  But, that's why the wait time is calculated
based on the last time a request really was completed.

Having a bunch of "sleeping 5 seconds" in there without any other
requests happening doesn't make sense.

Can you generate a simple test case?  This is what I did:

test.cgi:

    #!/usr/bin/speedy
    use strict;
    use warnings;


    my $count = ( $ENV{QUERY_STRING} || '') =~ /count=(\d+)/ ? $1 + 1 : 1;

    if ( $count > 6000 ) {
        print <<EOF;
    content-type: text/html
    status: 404 Not Found

    <html><body>Not found</body></html>
    EOF

        exit;
    }

    print <<EOF;
    Content-Type: text/html

    <html>
    <head><title>This is doc $count</title></head>
    <body>
    <a href="test.cgi?count=$count">Rec$count</a>
    </body>
    </html>
    EOF


httpd.conf

    Include /etc/apache/modules.conf
    ErrorLog error_log
    PidFile  pid_file
    ServerName localhost

    TypesConfig /dev/null
    Listen 4321


    DocumentRoot /home/moseley/apache

    <files test.cgi>
        Options +ExecCGI
        SetHandler cgi-script
    </files>



spider.conf:


    moseley@bumby:~/apache$ cat spider.conf 
    @servers = (
        {
            base_url => 'http://localhost:4321/test.cgi',
            delay_sec => 5,
            keep_alive => 1,
            email => 'moseley@localhost',
        }
    );


Start apache:

    moseley@bumby:~/apache$ /usr/sbin/apache -d `pwd` -f httpd.conf


Run the spider:  (modified to print sleeping without debug enabled):

    moseley@bumby:~/apache$ ./spider.pl spider.conf >/dev/null
    ./spider.pl: Reading parameters from 'spider.conf'
    sleeping 5 seconds
    sleeping 5 seconds
    [...]
    Summary for: http://localhost:4321/test.cgi
         Connection: Close:      60  (0.1/sec)
    Connection: Keep-Alive:   5,941  (14.5/sec)
               Total Bytes: 698,679  (1704.1/sec)
                Total Docs:   6,000  (14.6/sec)
               Unique URLs:   6,001  (14.6/sec)

So it fetched 6000 docs and the sleeping messages went as expected.

Is there a way you can demonstrate what you are seeing so I can repeat
it?


-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Tue Jul 5 13:59:01 2005