On Mon, Dec 15, 2003 at 08:26:03AM -0800, Frank Naude wrote:
> Hi,
>
> Does anybody have a script to highlight/colorize search phrases in HTML
> documents? HTML tags should be left intact. The script should be similar
> to Google's cached pages functionality.
You might poke around the swish-e archives, as I think this has come up
before. I think someone used a regular expression to highlight. How
accurate you want to be will determine how complex the code needs to be.
I've done it using HTML::Parser to parse out the entire page
into an array for each word, tag or whatever unit I had, then flag
visible text vs. tags. Then built a secondary array of just the text.
Then I used that to flag what words and phrases need to be highlighted.
This took into consideration WordCharacters settings in swish-e,
stopwords, metanames, and stemming.
It can be slow, as you might imagine. Google likely has the cached
document pre-parsed. And I don't believe they highlight phrases
at all. That adds some extra processing (phrases can span html tags,
for example). So for google-like I'd guess just run the text through
HTML::Parser or HTML::TreeBuilder and then use a regexp on the text
nodes.
--
Bill Moseley
moseley@hank.org
Received on Mon Dec 15 17:01:56 2003