> On Sat, 2002-07-13 at 12:31, Michael wrote:
> > > perl swishspider ./testing http://members.aol.com/CamelsRFun/
> > > Does this take 5-10 minutes and use 100% CPU?
> > 22725 diabetes 19 0 4484 4484 1552 R 99.3 1.8 0:28 perl
> > ~5-6 minutes later
> > 22725 diabetes 18 0 4496 4496 1552 R 98.2 1.8 5:09 perl
> > ~ 7-8 minutes
> > 22725 diabetes 17 0 4496 4496 1552 R 98.6 1.8 7:03 perl
>
> Well, I'm not sure what the problem is. But, whatever is wrong is
> related to PERL and/or the swishspider script. Seems like it would
> be a problem with the PERL interpreter. But, I'm not sure.
>
I've found the problem.
m version of HTML::Parser::parse uses the the construct
$$buf s/(something)//;
to search recursively through the 40k contents of the buffer. This is
where all the time is going --oodles of it. The code is not
particularly efficient and I suspect that if the substitution were
not done and
study $$buf;
$$buf =~ m/\G$pattern/gc, pos(...).... were used
that it would be substantially faster
############ time passes and this response sits on the desktop
I've checked CPAN and the current version of
HTML::Parser is 3.26, I've go 2.22 loaded, sigh......
upgrading has done the trick, very fast
Some mention of the various revision levels or changing the USE
statement in swishspider to require current version would help a
little :-) My LWP is up to date, as are most other modules, just not
HTTP::Parser. sigh......
Michael
> You might try using the prog method with spider.pl. In the prog-bin
> directory are some example scripts. Grab spider.pl and make sure
> the #! line points to perl.
>
> Then try this:
> ./spider.pl default http://members.aol.com/CamelsRFun/
>
> You should see each document printed to your terminal. If that
> script has problems then I'd be looking around to see if there are
> problems with the PERL interpreter.
>
>
> $ cat c
> IndexDir ./spider.pl
> SwishProgParameters default http://www.insulin-pumpers.org/
> IndexFile ./swish.index
> IndexName "Insulin Pumpers Mail Archive"
> IndexDescription "no other index was specified."
> IndexPointer "www.insulin-pumpers.org"
> IndexAdmin "webmaster@insulin-pumpers.org"
> MetaNames author description datamodified
> IndexReport 3
> UseStemming yes
> PropertyNames author description datamodified
> IgnoreTotalWordCountWhenRanking yes
> MinWordLimit 4
> WordCharacters abcdefghijklmnopqrstuvwxyz0123456789.-_'"
> IgnoreLimit 80 1000
> IndexComments 0
> TmpDir ./
>
>
> $ ./swish-e -c c -S prog -v3
> Parsing config file 'c'
> Indexing Data Source: "External-Program"
> Indexing "./spider.pl"
> ./spider.pl: Reading parameters from 'default'
> http://www.insulin-pumpers.org/ - Using DEFAULT (HTML) parser -
> (340 words) ...
>
> --
> David Norris
> Dave's Web - http://www.webaugur.com/dave/
> Augury Net - http://augur.homeip.net/
> ICQ - 412039
>
Received on Sat Jul 13 21:24:04 2002