Skip to main content.
home | support | download

Back to List Archive

Re: swish-e only spiders the server it started on

From: Cas Tuyn <cas.tuyn(at)not-real.gmail.com>
Date: Mon May 15 2006 - 09:23:50 GMT
Hi,

The command to start the indexing is:
          swish-e -c swish-e.conf -S prog

The config file is:

@ servers = ({
   skip        => 0,  # skip spidering server flag
   base_url    => 'http://aaa.company.com/intranet/index.html',
   credentials => 'not-a:chance',
   agent       => 'swish-e spider http://swish-e.org/',
   email       => 'Bogus@company.com',

   # limit to only .html files
   delay_sec   => 0,        # Delay in seconds between requests
   max_time    => 60,       # Max time to spider in minutes
   max_files   => 10000,    # Max Unique URLs to spider
   max_indexed => 10000,    # Max number of files to send to swish for indexing
   max_depth   => 3,        # the max number of layers to spider
   keep_alive  => 1,        # enable keep alives requests
   use_cookies => 1,        # True will keep cookie jar
   validate_links => 1,     # Solution to the single webserver?
});
1;

Note that the max_depth is now low to avoid waiting too long for the
indexing, but that the link across servers is found within these 3
levels. Also the link should be found well within the max_files and
max_indexed.

Cas


On 5/11/06, Peter Karman <peter@peknet.com> wrote:
> Since you didn't post your config files, we have no way of knowing if
> the problem is there.
>
> Chances are good that you need to list all 3 base URLs in your config
> file, since the spider likely sees them as different hosts and doesn't
> follow them. If it did follow, by default, a link to http://google..../
> could prove disastrous. ;)
Received on Mon May 15 02:23:59 2006