Skip to main content.
home | support | download

Back to List Archive

Re: URL-fixing with callback routines for

From: Bill Moseley <moseley(at)>
Date: Thu May 19 2005 - 15:22:40 GMT
On Thu, May 19, 2005 at 04:51:58PM +0200, koszalekopalek wrote:
> 1) Go to
> 2) Populate %visited with
> 3) Use filter_content to change
>    to

That just changes the output, right.

> 4) Index the document and keep on spidering
> 5) When the spider finds
>    it does not know that this URL was already spidered.
>    So spidering goes on for ever...

Yes, that's true -- it's a different URL.

If I remember correctly, the %visited hash gets set when extracting
links, so it's not easy to do what you are trying.  If you server
continues to give new URLs to follow then the spider will follow
those.  (You might try changing $uri->path in a test_response
callback,  but I don't think that will work).

So, just maintain your own %seen hash in the config file.  Normalize
the URL and add it to %seen in test_url -- and return 0 if %seen(
$url ) already exists.

That's why the config file is not plain text -- so you can do things
like that.

Bill Moseley

Unsubscribe from or help with the swish-e list:

Help with Swish-e:
Received on Thu May 19 08:22:47 2005