Skip to main content.
home | support | download

Back to List Archive

Re: Max depth error??

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Tue May 24 2005 - 13:26:28 GMT
On Tue, May 24, 2005 at 01:29:09AM -0700, Juan Salvador Castejón wrote:
> 
> I'm indexing a web site using spider.pl on a windows XP machine. The
> problem is that shiwsh-e does not index files whose depth is >= 5. The
> spider is crawling rightly all the pages and supplying all them to
> swish-e, but swish ignores completly all the pages whose depth is >=
> 5.

max_depth is a count of recursion, not of number of path segments.
It's a count of a pages ancestors, really.  It was added really more to
protect against dynamic links that continue for ever (such as changing
session id in a link).  I'm not sure how valuable it is.

> 	max_depth	=> 10,

So you could still have a page:

    http://localhost/some/path/index.html

that is 10 "deep" because it took 10 pages of links to find it.

    index.html has a link to index1.html.  index1.html has link to
    index2.html.  index2.html has link to index3.  [...]
    index9.html has a link to http://localhost/some/path/index.html.

(might have an off-by-one error there...)



If you want path segments then count them in test_url.

Something like (untested)

    test_url    => sub {
        my ($uri) = @_;
        # not a good idea to call split in scalar context
        my @segments = split(m!/!, $uri->path);
        # @segments will have extra value due to leading slash.
        return @segments > 10;
    },

-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Tue May 24 06:26:32 2005