On Fri, Oct 01, 2004 at 08:50:33AM -0700, Jon Sorensen wrote:
> I'm trying to spider a number of sites but spider.pl keeps getting in a loop
> at:
>
> https://secure.meriter.com/classreg/desc.cfm?CatID=35&ClassID=2310&RegID= -
> Using HTML2 parser - (582 words)
> https://secure.meriter.com/classreg/desc.cfm?CatID=35&ClassID=2311&RegID= -
> Using HTML2 parser - (591 words)
> https://secure.meriter.com/classreg/desc.cfm?CatID=35&ClassID=2312&RegID= -
> Using HTML2 parser - (582 words)
> https://secure.meriter.com/classreg/desc.cfm?CatID=35&ClassID=2313&RegID= -
> Using HTML2 parser - (582 words)
Looks like the ClassID just changes, right? If so you could strip
that parameters out of the URL in a test_url function. man URL
describes how to deal with the uri object that is passed in. I think
there's some examples in the list archives.
> it was getting stuck on ReviewClasses.cfm so I'm using test_url to stop that
> but I want to index the desc.cfm pages, I'm planning on trying use_md5 but
> not sure if that
> will make any difference
It should make a difference if the content is the same for each
request. But removing the ClassID should solve the problem, too (if
that's what is making the URL different each time.
> my %serverD = (
> base_url => 'https://secure.meriter.com/classreg/',
> email => 'jon@starkmedia.com',
> keep_alive => 1,
> test_url => sub {
> my $uri = shift;
> return 0 if $uri->path =~ /ReviewClasses\.cfm/;
> return 1;
> }
> #use_md5 => 1,
> );
> @servers = ( \%serverD, );
>
> I'm not sure why this is getting stuck or how to debug for this issue
> I checked the -T trace flag options for indexing but nothing seems to
> pertain to this
That test_url function looks ok to me.
You can do any tricks you like in test_url, of course:
my $seen_it;
[...]
test_url => sub {
my $uri = shift;
if ( $uri->path =~ /desc\.cfm$/ ) {
return !$seen_it++; # only index it once.
},
--
Bill Moseley
moseley@hank.org
Unsubscribe from or help with the swish-e list:
http://swish-e.org/Discussion/
Help with Swish-e:
http://swish-e.org/current/docs
swish-e@sunsite.berkeley.edu
Received on Fri Oct 1 09:28:19 2004