RE: Indexing Off Site Links

From: Antonio Barrera <abarrera(at)not-real.Princeton.EDU>
Date: Thu Sep 16 2004 - 19:43:39 GMT

I think that is precisely, what I'm looking for!

From: Thomas Dowling [] 
Sent: Thursday, September 16, 2004 3:37 PM
To: abarrera@Princeton.EDU
Cc: Multiple recipients of list
Subject: Re: [SWISH-E] Indexing Off Site Links

Antonio Barrera wrote:

>I've seen some threads about similar problems to the one I'm facing, 
>yet many were older solutions.
>My base url is: .  However, there are 
>links to other servers which I would want to index, without indexing the
entire site.
>Prior to indexing I have some knowledge of servers/directories, I do 
>want to search.
>For instance:  I may want to index,
> but not all of 
>  Or I may want to do 
> but not all of 

Somewhere along the way, I picked up this syntax in the spider.conf file:

my %SecondarySite = (
  base_url      => '',
  email         => '',
  delay_sec     => 1,

  test_url      => sub {
    my $uri = shift;

    # Skip requesting files that are probably not text
    return if $uri->path =~ m[\.(?:gif|jpg|jpeg|png|css)$]i;

    # Limit spidering by path
    # We only want the /documentation/ directory
    return unless $uri->path =~ /documentation/;

    return 1;  # otherwise, ok to search


@servers = (\%MainSite, \%SecondarySite); =============

Thomas Dowling
Received on Thu Sep 16 12:44:23 2004