On Tue, May 25, 2004 at 12:06:26PM -0700, Justin Tang wrote:
> Hi:
> I was wondering if there is any way to change the URL that is about to be
> queued using a call back function in test_url. Specifically, say if I have
>
> www.mysite.com/page.html?query=value
>
> to be placed in the queue, and I want to change it to
>
> www.mysite.com/page.html
>
> how can I change the URL that is being passed back? Thanks!
I think so. Try something like:
sub remove_query {
my ( $uri ) = @_;
$uri->query( undef )
if $uri->path eq '/page.html';
return 1;
}
then in your spider config
test_url => \&remove_query,
(I think you can specify more than one function like this, if you needed
to do so:
test_url => [ \&remove_query, \&other_subroutine ],
$uri is a URI object. perldoc URI to see how you can mess with it.
Note that after test_url is checked, spider.pl then checks if
$uri->canonical has been visited before. So if you do the above it will
only be visited once.
>
> -Justin
>
>
--
Bill Moseley
moseley@hank.org
Received on Tue May 25 12:18:47 2004