Skip to main content.
home | support | download

Back to List Archive

Unable to spider certain pages, md5 problem

From: Tim Hartley <tim.hartley(at)not-real.planetpdf.com>
Date: Wed Sep 08 2004 - 06:49:41 GMT
Hiya,

I'm trying to get Swish-E up and running on our test website with the aim of using it over our current Google-based search. 

Anyway, currently I'm having two problems, firstly it doesn't seem to be spidering some vital asp pages, and secondly I'm still getting some duplicated pages despite having MD5 enabled. However I am a Perl gumbie, so I'm hoping that I'm missing something pretty basic..

 ++ Spidering Problem: ++

Spider.pl is not accessing certain ASP pages, in particular the article.asp page, in which all our articles are displayed in (e.g. /article.asp?contentid=1545, /article.asp?contentid=1546, etc etc).  So this means that searching the created index does not return any articles directly.

Search results do indirectly reference them by returning a non-article.asp page that links to it. For example by returning an author bio page, which lists the articles that that author has contributed to the site.

 ++ MD5 Problem: ++

Due to the layout of the website, some articles and areas are duplicated in more than one area of the site. Some examples are: 
 - "/author.asp?author=John_Doe" is being displayed right next to "/author.asp?author=john_doe"
 - Introduction to Development (/developer/learningcenter.asp?ID=100), being displayed with Introduction to Development (/creative/learningcenter.asp?ID=100).

 ++ More info: ++

 - All pages get their content from a database via ASP commands, and article.asp uses the same style of commands as other pages that are spidered, so in theory it's data should be spidered just as easily as the other pages that are getting spidered.
 - As the site's using ASP pages, I'm using the ASPEXEC dll to call the swish exe, since I have no clue when it comes to Perl.
 - The parameters to call Swish-e from within the ASP page are: 	strParam = "-w """&strSearchWords&""" -b " & intStart & " -m " & intPage & " -x ""\<BR\><swishrank>|<swishtitle>|<swishdocpath>|<swishlastmodified>|<description>"" -f ""c:\swish-e\testindex.index"""

Config file: 
---------- PPswish.conf ----------
IndexFile c:\swish-e\testindex.index
IndexDir spider.pl

#ignore Javascript in the description
IgnoreMetaTags script

StoreDescription HTML2 <swishdescription> 200000

#Just search in the title, description meta fields for the search terms
MetaNames description 
MetaNameAlias description title article_display
PropertyNames description
PropertyNameAlias description title article_display

#Show a detailed report whilst indexing 
IndexReport 3

#Obey the Robots.txt index file whilst indexing
obeyRobotsNoIndex yes

SwishProgParameters spider.config
-------------------------------------

 - spider.config file

------ spider.config ---------
 @servers = (
     {
      base_url    => 'http://cm3.planetpdf.com/creative/article.asp',
	email => 'binary@binarything.com',
      use_md5 => 1, 
     },
 );
---------------------------------

Any help would be MUCH appreciated!

-Tim
_____________________
Tim Hartley
www.planetpdf.com
Received on Tue Sep 7 23:50:00 2004