Hello,
The website that I'm trying to index uses two URL schemes:
1) if the browser/agent accepts cookies, "regular" urls are used,
for example: http://my.host/some/path
2) if cookies are rejected a random string is inserted into the
URL, e.g.
http://my.host/(A(A_random_string_inserted_if_cookies_rejected))/some/path
The random strings change during the session.
The second URL scheme (i.e. the one with the random_string) is used
for spider.pl. I want to re-write the URLs so that swish-e returns
"regular" URLs.
I tried two callback functions for spider.pl (test_url and
filter_content) but both did not work:
1)
sub my_filter_content {
my $path = $uri->path;
# remove random string from $path
$path =~ s{/\(A\(.*?\)\)}{};
$uri->path ($path);
return 1;
}
URLs are correctly re-written but the spider never stops spidering.
This is what happens (I guess):
a) The spider reads, say:
http://my.host/(A(A000))/doc1.htm
b) Feeds it to swish-e for indexing as:
http://my.host/doc1.htm
c) The spider enters the "funny" URL
http://my.host/(A(A000))/doc1.htm into the %visited hash. So the next
time when it comes across the same URL (but with a modified random
string, e.g. http://my.host/(A(A333))/doc1.htm) that url is not
considered visited. The spider reads the page again and sends it
to swish-e for (re-)indexing as http://my.host/doc1.htm
This goes for ever..
2)
sub my_test_url {
# ...
my $path = $uri->path;
$path =~ s{/\(A\(.*?\)\)}{};
$uri->path ($path);
return 1;
}
No page gets indexed. I suppose this is what happens:
a) The spider goes to http://my.host and enters
http://my.host to the %visited hash.
b) At http://my.host it gets redirected to
http://my.host(A(A_random_string))/.
c) Subroutine my_test_url re-writes the url but the re-written
url is seen as already visited..
Is there a way to fix the urls in the callback functions,
without hacking spider.pl?
Thanks,
A.
-------------------------------------------------------------------
Idol. Zobacz, czego nie widza inni
>> http://link.interia.pl/f187e <<
Received on Thu May 19 04:25:25 2005