Skip to main content.
home | support | download

Back to List Archive

MD5 not filtering out 'variant' querystrings

From: Tim Hartley <tim.hartley(at)not-real.planetpdf.com>
Date: Thu Dec 02 2004 - 03:27:09 GMT
Hi All,

Firstly, a big thank you to Bill Mosely and Peter Karman - thank you for your prompt and helpful responses to my many questions posted here.

Secondly, I've got another question :)

One of the searches that I've set up on my site is returning multiple hits of the same page. The hits show that the only differences are slight variances in the querystrings. http://blah.com?details.asp?prodid=615, http://blah.comdetails.asp?prodid=615&fa, etc etc. I'm using MD5 in my spider configuration but it doesn't seem to clear these duplications up, and in some cases I get up to six varied querystrings for the same page. Any clues? Details follow.

Regards,

Tim

+++Conf file+++
IndexDir spider.pl
IndexReport 3
IgnoreMetaTags script
obeyRobotsNoIndex yes
StoreDescription HTML2 <description> 500
IndexFile c:\swish-e\storeSearchIndex.index
SwishProgParameters PS_Search_Spider.config

+++Config File+++
@servers = (
     {
      debug =>DEBUG_SKIPPED | DEBUG_ERRORS | DEBUG_URL,
      base_url    => 'http://www.pdfstore.com/',
	email => 'binary@binarything.com',
	keep_alive=>1,
	agent 	=> 'pp_ps',
	test_url => sub {
		use URI::QueryParam;
	 	my ( $uri, $server) = @_;
		my $id = $uri->query;
		$id = lc( $id);
		$uri->query( $id );
		return 1;
	      },
	use_md5 => 1, 
 	},
 );

+++Search Summary:+++
Connection: Keep-Alive: 115 (1.3/sec)
Duplicaties: 3,627(40.8/sec)
Off-site links: 1,231 (13.8/sec)
Total Bytes: 2,855,450 (32083.7/sec)
Total Docs: 81 (0.9/sec)
Unique URLs: 116 (1.3/sec)
robots.txt: 33 (0.4/sec)
(1411 words)
removing ver common words...
no words removed.
Writing main index...
Sorting words...
Sorting 3,055 words alphabetically
Writing header...
Writing index entries ...
Writing word text: Complete
Writing word hash: Complete
Writing word data: Complete
3,055 unique words indexed.
Received on Wed Dec 1 19:27:10 2004