Skip to main content.
home | support | download

Back to List Archive

Re: Indexing (test_url)

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Tue Aug 09 2005 - 18:18:34 GMT
On Tue, Aug 09, 2005 at 01:53:57PM -0400, Richard Vaillancourt wrote:
> On 9293 pages fetched, only 1272 are unique and that is because we're
> still fetching pages that have that ";jsessionid=" ending more than
> once.

So, what's happening is you initially go to a link without a
jsessionid on your URL.  The site sees that your request is missing
the jsessionid so a new session ID is created and added to every link
on the page.

So then the spider follows all those links using the jsessionid.  But,
at some point some page is displaying a link back to your site
*without* the jsessionid.  The spider follows that link and then your
system generates a NEW jsessionid and the process starts all over.

> We also tried using a filter_content() callback function, without
> success.

I tend to like more details when someone says something like that. ;)

> While it seems that we can filter before indexing, it would be more
> handy to filter before fetching. Ideas anyone?

Ok, before fetching?  Well, how about something along the lines of
this in a test_url:

    # always use the same jsessionid, regardless of what the server
    # tells us.
    if ( jsessionid in url ) {
        if ( previously_seen_jsessionid ) {
            set jsessionid on url = previously_seen_jsessionid;
        } else {
            previously_seen_jsesionid = jsessionid on url;
        }
    }

Another option is to just keep your own cache of URLs:

    # Avoid duplicate URLs
    my $new_url = url with jsessionid removed;
    return !$seen_url_before{ $new_url }++;





-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Tue Aug 9 11:18:34 2005