On Wed, Mar 10, 2004 at 05:03:58PM -0800, Justin Tang wrote:
> Hi:
> Is there anyway, when using spider.pl, to avoid spidering duplicate pages
> with different session vars?
>
> For ex:
> www.mysite.com/cat/page.html?sesID=2JLHGJ54KH2G3J4HG
> www.mysite.com/cat/page.html?sesID=23KJ54HIYGOIYSDFH
>
> This are two of the same pages, but getting spidered twice. Thanks.
This is all without testing, but...
One way would be to keep a hash in a test_url callback and reject
duplicates. Or, in the test_url() function you could remove the sesID
from the URL. The test to see if a page has been seen comes after
test_url() is called, IIRC.
--
Bill Moseley
moseley@hank.org
Received on Wed Mar 10 22:22:22 2004