Skip to main content.
home | support | download

Back to List Archive

Re: Duplicate Entries - BUG?

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Sat Nov 03 2001 - 01:05:13 GMT
At 04:42 PM 11/02/01 -0800, Bruce Pettyjohn wrote:
>I have noticed that there are duplicate entries for the URLs which are 
>replicated on
>many pages.  There does not seem to be any way to ensure that this does not 
>happen.
>Is it a bug or is there a configuration error on my part?

What do you mean by duplicate?  

You mean the spider is indexing pages more than once? 


>#		     Duplicates:      80,345  (6.2/sec)

That "Duplicates:" means how many links it found that it already spidered
i.e. skipped because it already saw that URL.


>#		31856 unique words indexed.
>#		5 properties sorted.
>#		13339 files indexed.  198504356 total bytes.
>#		Elapsed time: 03:36:03 CPU time: 00:16:20

Are you indexing a remote web server?  Or do you have a delay set?  I'm
wondering why it's taking 3 1/2 hours to index.  

If you have a current LWP setup you can run with "keep alive", which will
help both you (faster indexing) and the server (fewer requests).

BTW -- there's also a way to index links, so you can say "what pages link
to this url".

Sorry, I guess I'm not clear on the problem.



Bill Moseley
mailto:moseley@hank.org
Received on Sat Nov 3 01:05:34 2001