Skip to main content.
home | support | download

Back to List Archive

Re: URL-fixing with callback routines for spider.pl

From: koszalekopalek <koszalekopalek(at)not-real.interia.pl>
Date: Thu May 19 2005 - 15:57:39 GMT
Bill Moseley wrote:
> If I remember correctly, the %visited hash gets set when extracting
> links, so it's not easy to do what you are trying.  

Ok, I whipped this up. The %bogus_visited hash is populated in
test_url subroutine. The spider is running now. Do you think
it will work?


sub dbg {
	open (FH, ">> __dbg.log");
	print (FH "$_[0]");
	close (FH);
};

sub my_test_url {
	my $uri = shift;
	my $path = $uri->path;
	my $url = $uri->canonical;
	
	my %bogus_visted;
	
	# skip images
	return 0 if $uri->path =~ m{\.(gif|png|jpeg|jpg)$}i;
	# skip archives
	return 0 if $uri->path =~ m{\.(zip|gz|tgz|tar)$}i;
	
	# hash for bogus urls
	# change   http://my.host/(A(AcWS....4PGw2))/default.aspx
	# to       http://my.host/(__bogus__)/default.aspx
	if ($url =~ s{/\(A\(.*?\)\)}{(__bogus__)}) {
		if ($bogus_visited{$url}) {
			dbg ("BOGUS (duplicate): $url\n");
			return 0;
		} else {
			dbg ("BOGUS (new): $url\n");
			$bogus_visited{$url} = 1;
		};
	};
	return 1;
};




 > If you server
 > continues to give new URLs to follow then the spider will follow
 > those.  (You might try changing $uri->path in a test_response
 > callback,  but I don't think that will work).

------------------------------------------------------------------
Randka przez komorke?
>> http://link.interia.pl/f187f <<
Received on Thu May 19 08:57:45 2005