Skip to main content.
home | support | download

Back to List Archive

Re: URL-fixing with callback routines for spider.pl

From: koszalekopalek <koszalekopalek(at)not-real.interia.pl>
Date: Thu May 19 2005 - 16:07:56 GMT
koszalekopalek wrote:
> Bill Moseley wrote:
> 
>>If I remember correctly, the %visited hash gets set when extracting
>>links, so it's not easy to do what you are trying.  
> 
> 
> Ok, I whipped this up. The %bogus_visited hash is populated in
> test_url subroutine. The spider is running now. Do you think
> it will work?


Looks like it worked :-) At least the spider is
not spidering any more.

Btw, any pointers on why the server is not happy
with use_cookies => 1, ?

A.



> sub dbg {
> 	open (FH, ">> __dbg.log");
> 	print (FH "$_[0]");
> 	close (FH);
> };
> 
> sub my_test_url {
> 	my $uri = shift;
> 	my $path = $uri->path;
> 	my $url = $uri->canonical;
> 	
> 	my %bogus_visted;
> 	
> 	# skip images
> 	return 0 if $uri->path =~ m{\.(gif|png|jpeg|jpg)$}i;
> 	# skip archives
> 	return 0 if $uri->path =~ m{\.(zip|gz|tgz|tar)$}i;
> 	
> 	# hash for bogus urls
> 	# change   http://my.host/(A(AcWS....4PGw2))/default.aspx
> 	# to       http://my.host/(__bogus__)/default.aspx
> 	if ($url =~ s{/\(A\(.*?\)\)}{(__bogus__)}) {
> 		if ($bogus_visited{$url}) {
> 			dbg ("BOGUS (duplicate): $url\n");
> 			return 0;
> 		} else {
> 			dbg ("BOGUS (new): $url\n");
> 			$bogus_visited{$url} = 1;
> 		};
> 	};
> 	return 1;
> };
> 

------------------------------------------------------------------
Randka przez komorke?
>> http://link.interia.pl/f187f <<
Received on Thu May 19 09:07:57 2005