Skip to main content.
home | support | download

Back to List Archive

Re: Unable to spider certain pages, md5 problem

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Thu Sep 09 2004 - 17:21:35 GMT
On Thu, Sep 09, 2004 at 05:39:41PM +1000, Tim Hartley wrote:
> However, my test_url function doesn't seem to be working, as I am
> still getting duplicate results caused by Uppercase/Lowercase urls
> to the same content -> /author.asp?author=Joe Blow,
> /author.asp?author=joe blow.
> Debug doesn't throw any errors or warnings regarding the test_url,
> and I've used the code given in the documentation example, but it's
> either not converting the url's to lowercase or if it is it's not
> comparing them successfully. The results are still displaying the
> output with uppercase characters, so I'm assuming it's not
> converting to lowercase.

>          test_url =>sub {
>                       my $uri = shift;
>                       $uri->path(lc$uri->path);
>                       return 1;
>           },

A print statement is a good debugging tool.

perldoc URI will discuss how to use the URI module.  But the short
answer is that the "path" and "query" are two different parts of the
URI.

Now, I know there's a better way to manage query parameters -- I just
can't remember right now, so until then you might try something like
this:

    test_url => sub {
        my ( $uri ) = @_;
        my %params = $uri->query_form;
        $_ = lc for values %params;
        $uri->query_form( %params );
        return 1;
    },

The important thing to think about here is that will break if you have
two parameters with the same name (like a multi-valued parameter).  If
that'st the case then you likely need to stick to arrays and not use a
hash.

Let's see, how about this:

    test_url => sub {
        my $uri = shift;
        my @params = $uri->query_form;
        return 1 unless @params;
        my $x = 0;
        $x++ %2 && ($_ = lc ) for @params;
        $uri->query_form( @params );
        return 1;
    },



-- 
Bill Moseley
moseley@hank.org
Received on Thu Sep 9 10:22:18 2004