Skip to main content.
home | support | download

Back to List Archive

Double Slashes When Spidering

From: Michael Tsai <lists(at)not-real.mjtsai.com>
Date: Wed Jan 22 2003 - 19:32:15 GMT
Hi,

I'm trying to use SWISH-E and the included spider.pl script to index a
Web site. The command I'm using to generate the index is:

    ./swish-e -S prog -c swish.conf -v 2

The problem is that the spider goes into an infinite loop. After going
through all the pages on the site, it starts printing out entries like:

    Processing http://www.atpm.com//2.07/index.shtml...
    Processing http://www.atpm.com//2.06/index.shtml...

where it adds a second forward slash after the domain name. If I leave
it running long enough, it makes another pass over the pages with three
slashes.

I've seen some postings where people got around this problem when
indexing files in the local filesystem, but I didn't see any references
for what to do when using spider.pl.

Here is my swish.conf file:

    swish.conf
    # Program to read documents
    IndexDir ./spider.pl
    
    # Define the config file for the spider to use
    SwishProgParameters spider.conf     
             
    # Use libxm2 for parsing documents
    DefaultContents HTML2
    IndexContents TXT2 txt
    
    # Cache document contents in the index for context display
    StoreDescription HTML2 <body>

and here is the top part of spider.conf:

    @servers = (
        { 
            base_url        => ' http://www.atpm.com',
            same_hosts      => [ qw!atpm.com! ],
            email           => 'swish-e@atpm.com',
            delay_min       => .0001, 
        
            # Define call-back functions to fine-tune the spider 
        
            test_url        => sub {
                my $uri = shift; 
                
                # Skip requesting files that are probably not text
                return if $uri->path =~ m[\.(?:gif|jpg|png)$]i; 
                return if $uri->path =~ m[\.com//];

I've tried lots of variations of including or not including trailing
slashes in base_url and same_hosts, but it doesn't seem to make any
difference. Also, I thought that the last line (the regex with \.com//)
might help, but it doesn't seem to.

SWISH-E is working great on subsets of the site, but because of this
problem I can't get it to index the full site. There's probably some
simple solution that I'm overlooking. Any idea what it might be?

Thanks,

--Michael
Received on Wed Jan 22 19:32:35 2003