Re: optimization for

From: Bill Moseley <moseley(at)>
Date: Tue Jul 20 2004 - 20:27:19 GMT
On Wed, Jul 07, 2004 at 12:00:46PM -0700, Bill Schell wrote:
> I have been using (through swish.cgi and
> otherwise) to highlight search terms in whole documents, rather than
> just small sections, and found that for even medium size documents
> (tens of pages) that it was too slow.  I was burning 9 seconds of
> CPU just to do the highlighting in my test 53000 word document.
> This has impact not only for people doing things like I am, but for
> anyone who has high settings for show_words and max_words in their
> swishcgi.conf file.

I'm still trying to understand this.  I understand your patch, but
what I'm wondering is when/why you have a situation when all words
would be flagged to display.

The highlighting code looks for a phrase match in a document.  Then,
for context, flags not only the phrase for display, but a few words on
each side (that's the "show_words") setting.  Normally that's only a
few words (like five).

> Looked like a prime optimization candidate.   The intention of the
> $flags array is to flag words in the document that will later be
> output.  I have added a check before the $flag setting code code  to
> determine if all the words in the document are involved.  If so, it
> sets a $show_all_words flag once, and never sets anything in $flag.

So for that to happen then you must either have "show_words" set very
large, or you have a very small document (which then wouldn't be much
of a performance hit setting @flags word-by-word).

If you have "show_words" so large that the *entire* document will
show, then might want to use a different module that just uses
regexp's to flag the phrases and then output the entire doc.  That
would likely be much, much faster.  The trick (or trade off) is
figuring out regular expressions that work well enough.

Or am I missing something?  I guess I didn't expect a very large
setting for "show_words".

Someone sent me once a "better" regular expression highlighting module
a while back that sat in my inbox until I lost it in a clever move
once.  I'll see if I can't dig that up again.

Bill Moseley

Received on Tue Jul 20 13:27:39 2004