Skip to main content.
home | support | download

Back to List Archive

Re: Behavior of max_depth in

From: andy rosbrook <andy_rosbrook(at)>
Date: Fri Jan 12 2007 - 15:53:52 GMT
Ahh, i understand it now. I'll have a play about with that callback! Thanks 
Bill and Cas :)


>From: Bill Moseley <>
>To: Multiple recipients of list <>
>Subject: [SWISH-E] Re: Behavior of max_depth in
>Date: Fri, 12 Jan 2007 06:56:03 -0800 (PST)
>On Fri, Jan 12, 2007 at 06:28:24AM -0800, andy rosbrook wrote:
> > Hello all,
> >
> > I am curious on how the max_depth setting works in and sub
> > domains. For example if i index the url and set 
> > max_depth to 2 will the spider stay within the sub folder for links or 
> > it look inside
>max_depth isn't what you probably think it is.
>IIRC, depths is a measurement of how far a link is from the top level
>page where you started the spider.  That is, how many "click" it took
>to get to the current page from the top page.
>Obviously, you can often get to a given page by different click paths.
>So the same page can have different depths depending on how the spider
>find the page.
>It's not a measurement of, say, how many path segments a file is from
>the root.  That's trivial to measure and to test in a "test_url"
>function (just split the path on "/" and count).
>max_depth is just there because it's not something that can be counted
>outside of the spider (i.e. in your config).
>I think the docs on max_depth discuss this -- yes, slightly, even with
>the misspellings.
> > I've done a few tests and it seems to go back up into root folders at
> > certain times, i assume when it needs more links? Can anyone explain how 
> > traverses the pages and if it is possible to limit the spider to only 
> > links from the sub domain?
>The only built in limit the spider has is to stay within the domain.
>If you start at the spider will follow links to
>the root if they exist.  If you want it to say always within /sub/
>then test that in "test_url".  There's an example of this in the
>sample spider config "" included in the
>sub test_url {
>     my ( $uri, $server ) = @_;
>     # return 1;  # Ok to index/spider
>     # return 0;  # No, don't index or spider;
>     # ignore any common image files
>     return if $uri->path =~ /\.(gif|jpg|jpeg|png)?$/;
>     # make sure that the path is limited to the docs path
>     return $uri->path =~ m[^/current/docs/];
>Bill Moseley
>Unsubscribe from or help with the swish-e list:
>Help with Swish-e:

MSN Hotmail is evolving  check out the new Windows Live Mail
Received on Fri Jan 12 07:53:54 2007