Re: Spidering phpBB

From: Bill Moseley <moseley(at)>
Date: Tue Aug 31 2004 - 23:55:06 GMT
On Tue, Aug 31, 2004 at 01:05:01PM -0700, Shaffer, Chris wrote:
> As far as my problem crawling the forums...  I think I know what is
> going one...  The session_id is changing occasionally, causing it to go
> in circles...  Is there any way I can filter out something matching
> 'sid=....' from the end of the path before decides whether or
> not its crawled it yet?

Yes, the "test_url()" call-back function is called right before
checking if the URL has already been seen.  The test_url() function is
passed the URI object (perldoc URI) and that can be modified.

Untested, but maybe something like in your spider config.

    test_uri -> sub {
        my ( $uri ) = @_;
        my %params = $uri->query_form;
        delete $params{sid};
        $uri->query_form( %params );
        return 1;

Problem with that method (using a hash) is that you can't have
multiple parameters of the same name, so be careful.  If you might
have parameters with multiple values then look at using the
$uri->param method, instead, or use an array.

There's likely a better tool for dealing with query strings.

Bill Moseley

Received on Tue Aug 31 16:56:20 2004