Skip to main content.
home | support | download

Back to List Archive

Re: Crawling Sub-domains

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Thu Jan 04 2007 - 18:31:35 GMT
On Thu, Jan 04, 2007 at 10:07:22AM -0800, James wrote:
> I am also wondering if there is a way to get Swish-e's spider to
> automatically follow links to subdomains of the same domain, without having
> it follow off-site links to other domains.  Do you know what I mean?

The spider is just perl, so it's easy to change:

    # Here we make sure we are looking at a link pointing to the correct (or equivalent) host

    unless ( $server->{scheme} eq $u->scheme && $server->{same_host_lookup}{$u->canonical->authority||''} ) {

How about something like:

    unless ( $server->{scheme} eq $u->scheme && $u->host =~ /mydomain\.com$/ ) {


You might want to print out what $u->host returns.

run "perldoc URI" and take a look at things like:

    $uri->authority
    $uri->host
    $uri->host_port

-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Thu Jan 4 10:31:36 2007