Bill Moseley wrote:
>
> At 10:03 AM 04/22/02 -0700, Linda DeBoer wrote:
> > Whenever I run swish-e against a site which has a url pointing back
> >to the home page, it loops.
>
> You don't mean "loop" in that it indexes the same URL more than once, right?
>
It might if there is an equivalent URL not configured with the
EquivalentServer directive. I.e. http://www.sacto.com/ and http://sacto.com/
are two URL's for the same page. So wouldn't you need () in your config file ?
EquivalentServer http://sacto.com http://www.sacto.com
Or if the links back to the homepage, are not consistent, you might
also wind up with things like () being indexed separately.
http://sacto/
http://sacto.com/index.htm
http://www.sacto.com/index.htm
And then possibilities of case insensitivity if the host is MS-based
http://www.sacto.com/Index.htm
http://www.sacto.com/INDEX.htm
http://www.sacto.com/INDEX.HTM
> But, if you are using 2.1-dev, and the -S prog method with spider.pl then
> it's rather easy to do this.
>
> In the config you can say:
>
> test_url => sub {
> my $uri = shift;
> return $uri->path =~ m!^/some/path!;
> }
>
I do this. Just like Bill says, it works like a charm. :-)
If you want to see how I use this, you can check the
"spider configuration template" link from
http://www.arb.ca.gov/db/search/swishe/swishe.htm
> Another option, which would be fast, would be to run another web
> server/virtual host on a different port, and change the document root.
>
Interesting. Then you'd use the ReplaceRules directive to
rewrite the URL as it goes into the index?
Gerald
Received on Mon Apr 22 17:50:56 2002