Skip to main content.
home | support | download

Back to List Archive

Re: scope of indexing with spider.pl

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Tue Oct 29 2002 - 21:44:13 GMT
At 01:11 PM 10/29/02 -0800, Shen Yang wrote:
>Now that I am ready to index my site, a question occured to me: how 
>spider.pl knows when to stop crawling? Does the spider only index pages 
>of a given server and/or domain or does the spider.pl follow all the 
>links that it encounters, including links to sites in other servers 
>and/or domains?

It spiders one server, which is defined by a host name and a port number.

>For instance, if  my site in the domain ny.frb.org has 
>links to pages on www.firstgov.org, does that mean that the spider.pl 
>will also index pages in first.gov domain?

No.

The configuration file is a Perl array, with each element of the array
being a separate server config (represented by a perl hash.  This allows
you to index multiple servers.  See:

   http://swish-e.org/dev/docs/spider.html#CONFIGURATION_FILE

For a given server, you can use the "same_hosts" setting to say that
www.frb.org and frb.org are the same servers.

There's currently no way to say index www.frb.org but follow links to a
list of other servers from www.frb.org.


-- 
Bill Moseley
mailto:moseley@hank.org
Received on Tue Oct 29 21:48:01 2002