Skip to main content.
home | support | download

Back to List Archive

Duplicate Entries - BUG?

From: Bruce Pettyjohn <bruce.pettyjohn(at)not-real.varianinc.com>
Date: Sat Nov 03 2001 - 00:43:04 GMT
Hello,

We have been experimenting with the swish-e-2_1-dev-24-2001-10-18 release 
and are
impressed with the performance, flexibility, and accuracy.

I have noticed that there are duplicate entries for the URLs which are 
replicated on
many pages.  There does not seem to be any way to ensure that this does not 
happen.
Is it a bug or is there a configuration error on my part?

Our configuration is:
-----------------------------
Operation system: 	Solaris 8
Operating mode:	swish-e -c ./swish-e.conf -S prog

Web crawl resulted in:
#	MaxDepth 10:	with prog option and spider.conf	
#		/usr/varian/search/spider.pl: Summary for:
#		     Duplicates:      80,345  (6.2/sec)
#		Off-site links:      11,158  (0.9/sec)
#		PDF transformed:         712  (0.1/sec)
#		        Skipped:         206  (0.0/sec)
#		    Total Bytes: 198,504,356  (15325.0/sec)
#		     Total Docs:      13,339  (1.0/sec)
#		    Unique URLs:      13,630  (1.1/sec)
#
#		Removing very common words...
#		no words removed.
#		Writing main index...
#		Sorting words ...
#		Sorting 31856 words alphabetically
#		Writing header ...
#		Writing index entries ...
#		  Writing word text: Complete
#		  Writing word hash: Complete
#		  Writing word data: Complete
#		31856 unique words indexed.
#		5 properties sorted.
#		13339 files indexed.  198504356 total bytes.
#		Elapsed time: 03:36:03 CPU time: 00:16:20
#		Indexing done!

Any suggestions?

Thank you,
Bruce
Received on Sat Nov 3 00:43:38 2001