Skip to main content.
home | support | download

Back to List Archive

Re: Spider, but not index?

From: David Wood <dwood(at)not-real.inter.nl.net>
Date: Wed Jun 23 2004 - 14:04:39 GMT
In your spider config file, put something like this:

@servers = (

     {
         ...
         test_response => \&test_response,
         ...
     }

);



sub test_response {

     @SNUBBED_URLS = (
         "/index/index.htm",
         "/mainpage.new/pwebbrief.html",
         "/mainpage.new/pweb_faq.htm",
         "/products/products_nojs.htm",
         "/sitemap/map.htm",
         "/toolkit/salesmkt_toolkit.htm",
     );

     my $uri = $_[0];
     my $server = $_[1];
     my $url = "";


     # These URLs should be spidered, but not indexed, as they're too generic.
     foreach $url (@SNUBBED_URLS) {
         $server->{no_index} = 1 if ($uri->path =~ /$url$/);
     }

     return(1);

}



Cheers,

David






At 15:40 Wednesday 23-6-2004, David VanHook wrote:

>Is there a relatively easy way to get SWISH-E to spider a page (i.e., to
>follow all of the links on it), but to not index the contents of that same
>page?  I've tried using FileRules title in the config file, but am having no
>luck -- I get a Bad Directive error, even when I paste in the code directly
>from the online docs.
>
>Thanks!
>
>Dave VanHook
>dvanhook@mshanken.com



*********************************************************************
Due to deletion of content types excluded from this list by policy,
this multipart message was reduced to a single part, and from there
to a plain text message.
*********************************************************************
Received on Wed Jun 23 14:04:41 2004