Skip to main content.
home | support | download

Back to List Archive

Re: URL-fixing with callback routines for spider.pl

From: koszalekopalek <koszalekopalek(at)not-real.interia.pl>
Date: Thu May 19 2005 - 14:49:54 GMT
Bill Moseley wrote:

 > You should be able to set
 >
 >     use_cookies => 1,
 >
 > in the spider config to enable its cookie jar.

Actually I tried that but it did not work - the spider
was still re-directed to the "funny" URLs, (i.e.
http://my.host/(A(A_random_string_inserted))/some/path )


 >>I tried two callback functions for spider.pl (test_url and
 >>filter_content) but both did not work:
 >
 > test_url wouldn't work because that's before the request to the
 > server is made -- it would change what is requested from the server.

Ok, this is clear. Thanks.


 >>sub my_filter_content {
 >>	my $path = $uri->path;
 >>         # remove random string from $path
 >>	$path =~ s{/\(A\(.*?\)\)}{};
 >>	$uri->path ($path);
 >>	return 1;
 >>}
 >>
 >>URLs are correctly re-written but the spider never stops spidering.
 >>This is what happens (I guess):

[...]

 > I think it's something else.  The %visited hash gets set before all
 > of that.

Ok, so what I thought was happening was this:

1) Go to http://my.host/(00000)/doc1.htm
2) Populate %visited with http://my.host/(00000)/doc1.htm

3) Use filter_content to change
        http://my.host/(00000)/doc1.htm
    to
        http://my.host/doc1.htm

4) Index the document and keep on spidering

5) When the spider finds http://my.host/(11111)/doc1.htm
    it does not know that this URL was already spidered.

    So spidering goes on for ever...

Do I get it right?


 > This worked fine for me.  It still spiders the same number of
 > documents:
 >
 > @servers = (
 >     {
 >         base_url => 'http://localhost/apache/index.html',
 >         use_default_config => 1,
 >         filter_content => sub {
 >             my ( $uri, $response, $server, $content_ref ) = @_;
 >             $uri->path('hello');
 >             return 1;
 >         },
 >     }
 > );

It works, but does your server keep generating "bogus" new
links using this random_string (http://my.host/(random_string)/doc1.htm)
?

A.



------------------------------------------------------------------
Nowa odslona kultowej gry rajdowej nadjezdza z piskiem opon.
Scigaj sie Maluchami po polskich drogach w Maluch Racer 2
zobacz >> http://www.play.com.pl/index.php?go=opis&id=2891&&bid=14048
Received on Thu May 19 07:50:01 2005