Skip to main content.
home | support | download

Back to List Archive

Re: Indexing Off Site Links

From: Thomas Dowling <tdowling(at)not-real.ohiolink.edu>
Date: Thu Sep 16 2004 - 19:39:53 GMT
Antonio Barrera wrote:

>I've seen some threads about similar problems to the one I'm facing, yet
>many were older solutions.
>
>My base url is: http://library.princeton.edu .  However, there are links to
>other servers which I would want to index, without indexing the entire site.
>Prior to indexing I have some knowledge of servers/directories, I do want to
>search.  
>
>For instance:  I may want to index,
>http://www.princeton.edu/~rbsc/exhibitions/online.html but not all of
>www.princeton.edu.  Or I may want to do
>http://libweb5.princeton.edu/ejournals/by_title_zd.asp but not all of
>libweb5.princeton.edu.
>
>
>  
>

Somewhere along the way, I picked up this syntax in the spider.conf file:

=============
my %SecondarySite = (
  base_url      => 'http://foo.ohiolink.edu/documentation/',
  email         => 'tdowling@ohiolink.edu',
  delay_sec     => 1,

  test_url      => sub {
    my $uri = shift;

    # Skip requesting files that are probably not text
    return if $uri->path =~ m[\.(?:gif|jpg|jpeg|png|css)$]i;

    # Limit spidering by path
    # We only want the /documentation/ directory
    return unless $uri->path =~ /documentation/;

    return 1;  # otherwise, ok to search
  },

);

@servers = (\%MainSite, \%SecondarySite);
=============


--
Thomas Dowling
tdowling@ohiolink.edu
Received on Thu Sep 16 12:40:16 2004