Ahh, i understand it now. I'll have a play about with that callback! Thanks
Bill and Cas :)
andy
>From: Bill Moseley <moseley@hank.org>
>Reply-To: moseley@hank.org
>To: Multiple recipients of list <swish-e@sunsite3.berkeley.edu>
>Subject: [SWISH-E] Re: Behavior of max_depth in spider.pl
>Date: Fri, 12 Jan 2007 06:56:03 -0800 (PST)
>
>On Fri, Jan 12, 2007 at 06:28:24AM -0800, andy rosbrook wrote:
> > Hello all,
> >
> > I am curious on how the max_depth setting works in spider.pl and sub
> > domains. For example if i index the url www.somesite.com/sub/ and set
>the
> > max_depth to 2 will the spider stay within the sub folder for links or
>will
> > it look inside somesite.com?
>
>max_depth isn't what you probably think it is.
>
>IIRC, depths is a measurement of how far a link is from the top level
>page where you started the spider. That is, how many "click" it took
>to get to the current page from the top page.
>
>Obviously, you can often get to a given page by different click paths.
>So the same page can have different depths depending on how the spider
>find the page.
>
>It's not a measurement of, say, how many path segments a file is from
>the root. That's trivial to measure and to test in a "test_url"
>function (just split the path on "/" and count).
>
>max_depth is just there because it's not something that can be counted
>outside of the spider (i.e. in your config).
>
>I think the docs on max_depth discuss this -- yes, slightly, even with
>the misspellings.
>
>
> > I've done a few tests and it seems to go back up into root folders at
> > certain times, i assume when it needs more links? Can anyone explain how
>it
> > traverses the pages and if it is possible to limit the spider to only
>take
> > links from the sub domain?
>
>The only built in limit the spider has is to stay within the domain.
>If you start at www.somesite.com/sub/ the spider will follow links to
>the root if they exist. If you want it to say always within /sub/
>then test that in "test_url". There's an example of this in the
>sample spider config "SwishSpiderConfig.pl" included in the
>distribution:
>
>sub test_url {
> my ( $uri, $server ) = @_;
> # return 1; # Ok to index/spider
> # return 0; # No, don't index or spider;
>
> # ignore any common image files
> return if $uri->path =~ /\.(gif|jpg|jpeg|png)?$/;
>
> # make sure that the path is limited to the docs path
> return $uri->path =~ m[^/current/docs/];
>}
>
>--
>Bill Moseley
>moseley@hank.org
>
>Unsubscribe from or help with the swish-e list:
> http://swish-e.org/Discussion/
>
>Help with Swish-e:
> http://swish-e.org/current/docs
> swish-e@sunsite.berkeley.edu
>
_________________________________________________________________
MSN Hotmail is evolving – check out the new Windows Live Mail
http://ideas.live.com
Received on Fri Jan 12 07:53:54 2007