Skip to main content.
home | support | download

Back to List Archive

Re: duplicate documents

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Fri Oct 01 2004 - 16:27:58 GMT
On Fri, Oct 01, 2004 at 08:50:33AM -0700, Jon Sorensen wrote:
> I'm trying to spider a number of sites but spider.pl keeps getting in a loop
> at:
> 
> https://secure.meriter.com/classreg/desc.cfm?CatID=35&ClassID=2310&RegID= -
> Using HTML2 parser -  (582 words)
> https://secure.meriter.com/classreg/desc.cfm?CatID=35&ClassID=2311&RegID= -
> Using HTML2 parser -  (591 words)
> https://secure.meriter.com/classreg/desc.cfm?CatID=35&ClassID=2312&RegID= -
> Using HTML2 parser -  (582 words)
> https://secure.meriter.com/classreg/desc.cfm?CatID=35&ClassID=2313&RegID= -
> Using HTML2 parser -  (582 words)

Looks like the ClassID just changes, right?  If so you could strip
that parameters out of the URL in a test_url function.  man URL
describes how to deal with the uri object that is passed in.  I think
there's some examples in the list archives.


> it was getting stuck on ReviewClasses.cfm so I'm using test_url to stop that
> but I want to index the desc.cfm pages, I'm planning on trying use_md5 but
> not sure if that
> will make any difference

It should make a difference if the content is the same for each
request.  But removing the ClassID should solve the problem, too (if
that's what is making the URL different each time.


> my %serverD = (
>         base_url    => 'https://secure.meriter.com/classreg/',
>         email       => 'jon@starkmedia.com',
>       keep_alive  => 1,
>      test_url    => sub {
>          my $uri = shift;
>             return 0 if $uri->path =~ /ReviewClasses\.cfm/;
>             return 1;
>          }
>   #use_md5  => 1,
> );
> @servers = ( \%serverD, );
> 
> I'm not sure why this is getting stuck or how to debug for this issue
> I checked the -T trace flag options for indexing but nothing seems to
> pertain to this

That test_url function looks ok to me.  

You can do any tricks you like in test_url, of course:

my $seen_it;

    [...]

    test_url => sub {
        my $uri = shift;
        if ( $uri->path =~ /desc\.cfm$/ ) {
            return !$seen_it++;  # only index it once.
        },

-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Fri Oct 1 09:28:19 2004