Skip to main content.
home | support | download

Back to List Archive

Re: Duplicate files

From: Bill Moseley <moseley(at)>
Date: Fri Apr 26 2002 - 16:27:35 GMT
At 09:04 AM 04/26/02 -0700, GUEGAN Ronald wrote:
>Is there a way to detect that an HTML file as already been indexed ?
>We are indexing websites where a file can be accessed in various way :
>  -
>  -
>In the given example, both url could point to the same page.

If you are using (the soon to be a prelease) 2.1-dev version with -S prog
and then yes, you can.  That spider has a MD5 option to
fingerprint each page, so that should catch duplicates.

We discussed this just a few days ago, so you might check the list
archives, too.

Bill Moseley
Received on Fri Apr 26 16:27:36 2002