Skip to main content.
home | support | download

Back to List Archive

Re: Swish-e indexing the same file multiple times

From: <moseley(at)not-real.hank.org>
Date: Fri May 16 2003 - 11:35:04 GMT
On Fri, May 16, 2003 at 02:26:06AM -0700, A.Little wrote:

> I'm in the process of setting up swish-e to index various websites and I've
> come across a little problem in that swish-e will index
> http://www.mydomain.com/ and http://www.mydomain.com/index.html as 2
> seperate pages, even when index.html is the default page for
> www.mydomain.com.

There a few ways to do that if you are using the -S prog "spider.pl" 
program.

One is to enable MD5 checking -- that will avoid indexing duplicate content.

The other way would be to use a "test_url" function to add "index.html" to 
all URL's ending in "/".

    test_url => sub { 
        my $uri = shift;
        my $path = $uri->path;
        $uri->path( $path . "index.html" )
            if $path =~ m[/$];
    }
    
I didn't test that just now, so you might need to tweak.  You don't want to 
do that if it's possible the index file name has been changed.

-- 
Bill Moseley
moseley@hank.org
Received on Fri May 16 11:35:11 2003