Skip to main content.
home | support | download

Back to List Archive

Re: URL-fixing with callback routines for spider.pl

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Thu May 19 2005 - 15:22:40 GMT
On Thu, May 19, 2005 at 04:51:58PM +0200, koszalekopalek wrote:
> 1) Go to http://my.host/(00000)/doc1.htm
> 2) Populate %visited with http://my.host/(00000)/doc1.htm
> 
> 3) Use filter_content to change
>        http://my.host/(00000)/doc1.htm
>    to
>        http://my.host/doc1.htm

That just changes the output, right.

> 4) Index the document and keep on spidering
> 
> 5) When the spider finds http://my.host/(11111)/doc1.htm
>    it does not know that this URL was already spidered.
> 
>    So spidering goes on for ever...

Yes, that's true -- it's a different URL.

If I remember correctly, the %visited hash gets set when extracting
links, so it's not easy to do what you are trying.  If you server
continues to give new URLs to follow then the spider will follow
those.  (You might try changing $uri->path in a test_response
callback,  but I don't think that will work).

So, just maintain your own %seen hash in the config file.  Normalize
the URL and add it to %seen in test_url -- and return 0 if %seen(
$url ) already exists.

That's why the config file is not plain text -- so you can do things
like that.

-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Thu May 19 08:22:47 2005