Skip to main content.
home | support | download

Back to List Archive

Re: Finding Session Var While Spidering?

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Thu Mar 11 2004 - 06:22:14 GMT
On Wed, Mar 10, 2004 at 05:03:58PM -0800, Justin Tang wrote:
> Hi:
>   Is there anyway, when using spider.pl, to avoid spidering duplicate pages
> with different session vars?
> 
> For ex:
>   www.mysite.com/cat/page.html?sesID=2JLHGJ54KH2G3J4HG
>   www.mysite.com/cat/page.html?sesID=23KJ54HIYGOIYSDFH
> 
> This are two of the same pages, but getting spidered twice.  Thanks.

This is all without testing, but...

One way would be to keep a hash in a test_url callback and reject
duplicates.  Or, in the test_url() function you could remove the sesID
from the URL.  The test to see if a page has been seen comes after
test_url() is called, IIRC.


-- 
Bill Moseley
moseley@hank.org
Received on Wed Mar 10 22:22:22 2004