On Thu, May 19, 2005 at 04:23:36AM -0700, koszalekopalek wrote:
> Hello,
>
> The website that I'm trying to index uses two URL schemes:
>
> 1) if the browser/agent accepts cookies, "regular" urls are used,
> for example: http://my.host/some/path
You should be able to set
use_cookies => 1,
in the spider config to enable its cookie jar.
> 2) if cookies are rejected a random string is inserted into the
> URL, e.g.
> http://my.host/(A(A_random_string_inserted_if_cookies_rejected))/some/path
> The random strings change during the session.
>
> The second URL scheme (i.e. the one with the random_string) is used
> for spider.pl. I want to re-write the URLs so that swish-e returns
> "regular" URLs.
>
> I tried two callback functions for spider.pl (test_url and
> filter_content) but both did not work:
test_url wouldn't work because that's before the request to the
server is made -- it would change what is requested from the server.
> sub my_filter_content {
> my $path = $uri->path;
> # remove random string from $path
> $path =~ s{/\(A\(.*?\)\)}{};
> $uri->path ($path);
> return 1;
> }
>
> URLs are correctly re-written but the spider never stops spidering.
> This is what happens (I guess):
>
> a) The spider reads, say:
> http://my.host/(A(A000))/doc1.htm
> b) Feeds it to swish-e for indexing as:
> http://my.host/doc1.htm
> c) The spider enters the "funny" URL
> http://my.host/(A(A000))/doc1.htm into the %visited hash. So the next
> time when it comes across the same URL (but with a modified random
> string, e.g. http://my.host/(A(A333))/doc1.htm) that url is not
> considered visited. The spider reads the page again and sends it
> to swish-e for (re-)indexing as http://my.host/doc1.htm
I think it's something else. The %visited hash gets set before all
of that.
This worked fine for me. It still spiders the same number of
documents:
@servers = (
{
base_url => 'http://localhost/apache/index.html',
use_default_config => 1,
filter_content => sub {
my ( $uri, $response, $server, $content_ref ) = @_;
$uri->path('hello');
return 1;
},
}
);
Try that on your own small "web site" -- create three or four linked
pages and watch what happens.
--
Bill Moseley
moseley@hank.org
Unsubscribe from or help with the swish-e list:
http://swish-e.org/Discussion/
Help with Swish-e:
http://swish-e.org/current/docs
swish-e@sunsite.berkeley.edu
Received on Thu May 19 07:14:38 2005