Skip to main content.
home | support | download

Back to List Archive

Re: Indexing (test_url)

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Tue Aug 09 2005 - 19:51:46 GMT
On Tue, Aug 09, 2005 at 11:17:56AM -0700, Bill Moseley wrote:
> Another option is to just keep your own cache of URLs:
> 
>     # Avoid duplicate URLs
>     my $new_url = url with jsessionid removed;
>     return !$seen_url_before{ $new_url }++;

In private email you sent:

sub test_url {
    my ( $uri, $server ) = @_;
    # return 1;  # Ok to index/spider
    # return 0;  # No, don't index or spider;

    # always use the same jsessionid, regardless of what the server
    # tells us.
    my $new_url = url with jsessionid removed;
    return !$seen_url_before{ $new_url }++;

    # ignore any common image files
 #   $uri->path =~ s/\;jses.+//;
    return $uri->path !~ /\.(gif|jpg|jpeg|png)?$/;
    
}

Sorry, that was just a suggestion, not actual Perl code.

Also, that first return !$seen... means the rest of your code is
skipped.




my %seen_uri;

[later]

sub test_url {
    my ( $uri_orig ) = @_;

    my $uri = $uri_orig->clone;  # don't want to change original;

    # Warning -- destroys multiple parameter
    my %params = $uri->query_form;  # grab parameters and store as a hash
    delete $params{jsessionid};     # delete jsessionid

    if ( %params ) {
        $uri->query_form( %params  ); # update the parameter list
    } else {
        $uri->query( undef );         # or just erase the params.
    }

    return if $seen_uri{ $uri }++;  # return (false) if seen this URL before.

    [...]



BTW -- you posted this:

    http://www.sit.ulaval.ca/sgc/nous_joindre/cache/offonce;jsessionid=ECB59031040C6AD9AA3B49FFD2EDFC5E

That's not a proper URL.  Should be a '?', not a ';'  The above WILL
NOT work if you use those broken URLs.






-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Tue Aug 9 12:51:48 2005