Skip to main content.
home | support | download

Back to List Archive

Re: Indexing Off Site Links

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Thu Sep 16 2004 - 19:47:54 GMT
On Thu, Sep 16, 2004 at 12:04:37PM -0700, Antonio Barrera wrote:
> I've seen some threads about similar problems to the one I'm facing, yet
> many were older solutions.
> 
> My base url is: http://library.princeton.edu .  However, there are links to
> other servers which I would want to index, without indexing the entire site.
> Prior to indexing I have some knowledge of servers/directories, I do want to
> search.  

If you know ahead of time what external sites you wish to index (which
you have to know) you can specify more than one server config in the
spider config file -- each with its own set of parameters.  That's why
the config is stored in an array variable.

Otherwise,  you could modify the check_link() function in spider.pl.
check_link() allows any host that is also listed in "same_hosts"
parameter.  That config option is for sites that resolve to more than
one domain name (i.e.  www.swish-e.org, swish-e.org, swish-e.com,
www.swish-e.com).  So, you list swish-e.org as the base_url and the
others are listed in same_hosts.  You might use this is your site has
absolute URLs hard-coded in your pages and have used more than one
domain to all point to the same place.

The check_link() function then does this:

   $u->host_port( $server->{authority} );  # Force all the same host name

Which, as the comment says, forces all links to any of those sites to
be to the one specified in the base_url.  This prevents links to:

    www.swish-e.org/index.html
    swish-e.org/index.html

from being indexed twice.  Plus, it makes all search results have the
same host name.

So, if you don't need that feature then comment out that line and add
the extra host names to "same_hosts" parameter and then use a
test_url() call back function in your config to limit what files to
index on all the servers you index.


That make any sense?


-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Thu Sep 16 12:48:41 2004