Skip to main content.
home | support | download

Back to List Archive

Re: swish-e 2.1 hangs for a very long time

From: Michael <michael(at)not-real.insulin-pumpers.org>
Date: Sat Jul 13 2002 - 21:20:29 GMT
> On Sat, 2002-07-13 at 12:31, Michael wrote:
> > >   perl swishspider ./testing http://members.aol.com/CamelsRFun/
> > > Does this take 5-10 minutes and use 100% CPU?
> > 22725 diabetes  19   0  4484 4484  1552 R    99.3  1.8   0:28 perl
> > ~5-6 minutes later
> > 22725 diabetes  18   0  4496 4496  1552 R    98.2  1.8   5:09 perl
> > ~ 7-8 minutes
> > 22725 diabetes  17   0  4496 4496  1552 R    98.6  1.8   7:03 perl
> 
> Well, I'm not sure what the problem is.  But, whatever is wrong is
> related to PERL and/or the swishspider script.  Seems like it would
> be a problem with the PERL interpreter.  But, I'm not sure.
> 

I've found the problem.

m version of HTML::Parser::parse uses the the construct

$$buf s/(something)//;
to search recursively through the 40k contents of the buffer. This is 
where all the time is going  --oodles of it. The code is not 
particularly efficient and I suspect that if the substitution were 
not done and 
study $$buf;
$$buf =~ m/\G$pattern/gc, pos(...).... were used 
that it would be substantially faster

############ time passes and this response sits on the desktop

I've checked CPAN and the current version of 
HTML::Parser is 3.26, I've go 2.22 loaded, sigh......

upgrading has done the trick, very fast
Some mention of the various revision levels or changing the USE 
statement in swishspider to require current version would help a 
little :-)  My LWP is up to date, as are most other modules, just not 
HTTP::Parser. sigh......

Michael



> You might try using the prog method with spider.pl.  In the prog-bin
> directory are some example scripts.  Grab spider.pl and make sure
> the #! line points to perl.
> 
> Then try this:
>   ./spider.pl default http://members.aol.com/CamelsRFun/
> 
> You should see each document printed to your terminal.  If that
> script has problems then I'd be looking around to see if there are
> problems with the PERL interpreter.

> 
> 
> $ cat c
> IndexDir ./spider.pl
> SwishProgParameters default http://www.insulin-pumpers.org/
> IndexFile ./swish.index
> IndexName "Insulin Pumpers Mail Archive"
> IndexDescription "no other index was specified." 
> IndexPointer "www.insulin-pumpers.org"
> IndexAdmin "webmaster@insulin-pumpers.org"
> MetaNames author description datamodified
> IndexReport 3
> UseStemming yes
> PropertyNames author description datamodified
> IgnoreTotalWordCountWhenRanking yes
> MinWordLimit 4
> WordCharacters abcdefghijklmnopqrstuvwxyz0123456789.-_'"
> IgnoreLimit 80 1000
> IndexComments 0
> TmpDir ./
> 
> 
> $ ./swish-e -c c -S prog -v3
> Parsing config file 'c'
> Indexing Data Source: "External-Program"
> Indexing "./spider.pl"
> ./spider.pl: Reading parameters from 'default'
> http://www.insulin-pumpers.org/ - Using DEFAULT (HTML) parser - 
> (340 words) ...
> 
> -- 
>  David Norris
>   Dave's Web - http://www.webaugur.com/dave/
>   Augury Net - http://augur.homeip.net/
>   ICQ - 412039
> 
Received on Sat Jul 13 21:24:04 2002