Skip to main content.
home | support | download

Back to List Archive

URL-fixing with callback routines for

From: koszalekopalek <koszalekopalek(at)>
Date: Thu May 19 2005 - 11:25:22 GMT

The website that I'm trying to index uses two URL schemes:

1) if the browser/agent accepts cookies, "regular" urls are used,
    for example:

2) if cookies are rejected a random string is inserted into the
    URL, e.g.
    The random strings change during the session.

The second URL scheme (i.e. the one with the random_string) is used
for I want to re-write the URLs so that swish-e returns
"regular" URLs.

I tried two callback functions for (test_url and 
filter_content) but both did not work:

sub my_filter_content {
	my $path = $uri->path;
         # remove random string from $path
	$path =~ s{/\(A\(.*?\)\)}{};
	$uri->path ($path);
	return 1;

URLs are correctly re-written but the spider never stops spidering.
This is what happens (I guess):

a) The spider reads, say:
b) Feeds it to swish-e for indexing as:
c) The spider enters the "funny" URL into the %visited hash. So the next
    time when it comes across the same URL (but with a modified random
    string, e.g. that url is not
    considered visited. The spider reads the page again and sends it
    to swish-e for (re-)indexing as

This goes for ever..

sub my_test_url {
	# ...
	my $path = $uri->path;
	$path =~ s{/\(A\(.*?\)\)}{};
	$uri->path ($path);
	return 1;

No page gets indexed. I suppose this is what happens:

a) The spider goes to and enters to the %visited hash.

b) At it gets redirected to

c) Subroutine my_test_url re-writes the url but the re-written
    url is seen as already visited..

Is there a way to fix the urls in the callback functions,
without hacking


Idol. Zobacz, czego nie widza inni
>> <<
Received on Thu May 19 04:25:25 2005