Re: Crawling Sub-domains

From: Bill Moseley <moseley(at)>
Date: Thu Jan 04 2007 - 18:31:35 GMT
On Thu, Jan 04, 2007 at 10:07:22AM -0800, James wrote:
> I am also wondering if there is a way to get Swish-e's spider to
> automatically follow links to subdomains of the same domain, without having
> it follow off-site links to other domains.  Do you know what I mean?

The spider is just perl, so it's easy to change:

    # Here we make sure we are looking at a link pointing to the correct (or equivalent) host

    unless ( $server->{scheme} eq $u->scheme && $server->{same_host_lookup}{$u->canonical->authority||''} ) {

How about something like:

    unless ( $server->{scheme} eq $u->scheme && $u->host =~ /mydomain\.com$/ ) {

You might want to print out what $u->host returns.

run "perldoc URI" and take a look at things like:


Bill Moseley

