Skip to main content.
home | support | download

Back to List Archive

Re: Limiting indexing

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Sun Jun 30 2002 - 06:00:36 GMT
At 10:51 PM 06/29/02 -0700, Sutherland, Paul wrote:
>I want to only index certain directories on a server.
>
>e.g. http://something.com/foo & http://somthing.com/bar but not the rest of
>the site

This is reasonably easy with 2.1-dev and the spider.pl file if you know a
little perl.

When using the spider.pl program you can define a call-back function that
gets called for every URL extracted from the page.  If the call-back
function returns false then that URL will not be added to the list of URLs
to spider.

>From the top level directory of the swish-e distribution:

   perldoc prog-bin/spider.pl

and search for "test_url".

With the old -S http method you might be able to hack up th swishspider
perl program (located in the src directory) to filter out any links that
you don't want to spider.  That perl program is the one that fetches the
remote doc and extracts the links from the doc.



-- 
Bill Moseley
mailto:moseley@hank.org
Received on Sun Jun 30 06:04:04 2002