Skip to main content.
home | support | download

Back to List Archive

Re: Behavior of max_depth in spider.pl

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Fri Jan 12 2007 - 14:56:10 GMT
On Fri, Jan 12, 2007 at 06:28:24AM -0800, andy rosbrook wrote:
> Hello all,
> 
> I am curious on how the max_depth setting works in spider.pl and sub 
> domains. For example if i index the url www.somesite.com/sub/ and set the 
> max_depth to 2 will the spider stay within the sub folder for links or will 
> it look inside somesite.com?

max_depth isn't what you probably think it is.

IIRC, depths is a measurement of how far a link is from the top level
page where you started the spider.  That is, how many "click" it took
to get to the current page from the top page.

Obviously, you can often get to a given page by different click paths.
So the same page can have different depths depending on how the spider
find the page.

It's not a measurement of, say, how many path segments a file is from
the root.  That's trivial to measure and to test in a "test_url"
function (just split the path on "/" and count).

max_depth is just there because it's not something that can be counted
outside of the spider (i.e. in your config).

I think the docs on max_depth discuss this -- yes, slightly, even with
the misspellings.


> I've done a few tests and it seems to go back up into root folders at 
> certain times, i assume when it needs more links? Can anyone explain how it 
> traverses the pages and if it is possible to limit the spider to only take 
> links from the sub domain?

The only built in limit the spider has is to stay within the domain.
If you start at www.somesite.com/sub/ the spider will follow links to
the root if they exist.  If you want it to say always within /sub/
then test that in "test_url".  There's an example of this in the
sample spider config "SwishSpiderConfig.pl" included in the
distribution:

sub test_url {
    my ( $uri, $server ) = @_;
    # return 1;  # Ok to index/spider
    # return 0;  # No, don't index or spider;

    # ignore any common image files
    return if $uri->path =~ /\.(gif|jpg|jpeg|png)?$/;

    # make sure that the path is limited to the docs path
    return $uri->path =~ m[^/current/docs/];
}

-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Fri Jan 12 06:56:11 2007