Bill Moseley wrote:
> If I remember correctly, the %visited hash gets set when extracting
> links, so it's not easy to do what you are trying.
Ok, I whipped this up. The %bogus_visited hash is populated in
test_url subroutine. The spider is running now. Do you think
it will work?
sub dbg {
open (FH, ">> __dbg.log");
print (FH "$_[0]");
close (FH);
};
sub my_test_url {
my $uri = shift;
my $path = $uri->path;
my $url = $uri->canonical;
my %bogus_visted;
# skip images
return 0 if $uri->path =~ m{\.(gif|png|jpeg|jpg)$}i;
# skip archives
return 0 if $uri->path =~ m{\.(zip|gz|tgz|tar)$}i;
# hash for bogus urls
# change http://my.host/(A(AcWS....4PGw2))/default.aspx
# to http://my.host/(__bogus__)/default.aspx
if ($url =~ s{/\(A\(.*?\)\)}{(__bogus__)}) {
if ($bogus_visited{$url}) {
dbg ("BOGUS (duplicate): $url\n");
return 0;
} else {
dbg ("BOGUS (new): $url\n");
$bogus_visited{$url} = 1;
};
};
return 1;
};
> If you server
> continues to give new URLs to follow then the spider will follow
> those. (You might try changing $uri->path in a test_response
> callback, but I don't think that will work).
------------------------------------------------------------------
Randka przez komorke?
>> http://link.interia.pl/f187f <<
Received on Thu May 19 08:57:45 2005