Skip to main content.
home | support | download

Back to List Archive

Re: Highlight seach results in source documents

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Mon Dec 15 2003 - 17:01:49 GMT
On Mon, Dec 15, 2003 at 08:26:03AM -0800, Frank Naude wrote:
> Hi,
> 
> Does anybody have a script to highlight/colorize search phrases in HTML
> documents? HTML tags should be left intact. The script should be similar
> to Google's cached pages functionality.

You might poke around the swish-e archives, as I think this has come up 
before.  I think someone used a regular expression to highlight.  How 
accurate you want to be will determine how complex the code needs to be.

I've done it using HTML::Parser to parse out the entire page 
into an array for each word, tag or whatever unit I had, then flag 
visible text vs. tags.  Then built a secondary array of just the text.  
Then I used that to flag what words and phrases need to be highlighted.  
This took into consideration WordCharacters settings in swish-e, 
stopwords, metanames, and stemming.

It can be slow, as you might imagine.  Google likely has the cached 
document pre-parsed.  And I don't believe they highlight phrases 
at all.  That adds some extra processing (phrases can span html tags, 
for example). So for google-like I'd guess just run the text through 
HTML::Parser or HTML::TreeBuilder and then use a regexp on the text 
nodes.

-- 
Bill Moseley
moseley@hank.org
Received on Mon Dec 15 17:01:56 2003