Skip to main content.
home | support | download

Back to List Archive

Re: optimization for PhraseHighlight.pm

From: Peter Karman <karman(at)not-real.cray.com>
Date: Wed Jul 07 2004 - 19:24:26 GMT
excellent! Thanks, Bill.

FYI, I am currently at work on a HTML::HiLiter perl module written with 
swish-e in mind. If there is anyone out there interested in testing it, 
please email me offline.

This module aims to support phrases, CSS, character entities, etc. It 
uses HTML::Parser and other subclasses of HTML::.

pek

Bill Schell wrote on 7/7/04 2:01 PM:

> This is a multi-part message in MIME format.
> --------------070605000707000702040506
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
> Content-Transfer-Encoding: 7bit
> 
> Greetings,  fellow swishians!
> Or is it swishies? swishoids? swishers?  I think I like 
> swishians...sounds like
> the citizens of a tiny country, probably in eastern europe.
> 
> I have been using PhraseHighlight.pm (through swish.cgi and otherwise) 
> to highlight
> search terms in whole documents, rather than just small sections, and 
> found that for even
> medium size documents (tens of pages) that it was too slow.  I was 
> burning 9 seconds
> of CPU just to do the highlighting in my test 53000 word document.    
> This has impact not
> only for people doing things like I am, but for anyone who has high 
> settings for show_words
> and max_words in their swishcgi.conf file.
> 
> After a little use of the perl 'smallprof' profiler (counts times 
> individual lines are executed),
> I discovered that line 213 of PhraseHighlight.pm was being executed 47.9 
> million times
> for this document, accounting for over 90% of the CPU consumption.   It 
> turns out that
> this statement: "$flags[$_]++"  is executed: 
> (words in document  *  phrase_cont  * words_per_phrase   * 
> words_in_document)
> times.  
> 
> Looked like a prime optimization candidate.   The intention of the 
> $flags array is to flag
> words in the document that will later be output.  I have added a check 
> before the
> $flag setting code code  to determine if all the words in the document 
> are involved. 
> If so, it sets a $show_all_words flag once, and never sets anything in 
> $flag. 
> Later, the document outputting code looks at $show_all_words and outputs the
> whole document if it is set, otherwise it looks at $flags.   CPU usage 
> dropped from 9 seconds
> to less than 1 second (2Ghz x86 linux).
> 
> Modified PhraseHighlight.pm attached.  Diff listing follows (in case the 
> list strips off
> the attachment).
> 
> Bill Schell
> 
> $ diff PhraseHighlight.pm PhraseHighlight.pm.orig
> 101d100
> <     my $show_all_words = 0;
> 205,218c204,213
> <           my ($start, $stop);
> <           if (!$show_all_words) {
> <               $start = ($word_pos - $Show_Words + 1) * 2;
> <               $stop   = ($word_pos + $end_pos + $Show_Words - 2) * 2;
> <               if ( $start < 0 ) {
> <                   $stop = $stop - $start;
> <                   $start = 0;
> <               }
> <
> <               $stop = $#words if $stop > $#words;
> <
> <               $show_all_words = 1 if ($start == 0 && $stop == $#words);
> <               if (!$show_all_words) { $flags[$_]++ for $start .. $stop; }
> <           }
> ---
>  >             my $start = ($word_pos - $Show_Words + 1) * 2;
>  >             my $stop   = ($word_pos + $end_pos + $Show_Words - 2) * 2;
>  >             if ( $start < 0 ) {
>  >                 $stop = $stop - $start;
>  >                 $start = 0;
>  >             }
>  >
>  >             $stop = $#words if $stop > $#words;
>  >
>  >             $flags[$_]++ for $start .. $stop;
> 257c252
> <             if ( $show_all_words || $flags[$i] ) {
> ---
>  >             if ( $flags[$i] ) {
> 
> 
> 
> 
> 
> --------------070605000707000702040506
> Content-Type: text/plain;
>  name="PhraseHighlight.pm"
> Content-Transfer-Encoding: 7bit
> Content-Disposition: inline;
>  filename="PhraseHighlight.pm"
> 
> #=======================================================================
> #  Phrase Highlighting Code
> #
> #    $Id: PhraseHighlight.pm,v 1.2 2003/05/20 00:56:20 whmoseley Exp $
> #=======================================================================
> package SWISH::PhraseHighlight;
> use strict;
> 
> use constant DEBUG_HIGHLIGHT => 0;
> 
> sub new {
>     my ( $class, $settings, $headers ) = @_;
> 
> 
> 
>     my $self = bless {
>         settings => $settings,
>         headers  => $headers,
>     }, $class;
> 
> 
> 
>     if ( $self->header('stemming applied') =~ /^(?:1|yes)$/i ) {
>         eval { require SWISH::Stemmer };
>         if ( $@ ) {
>             warn('Stemmed index needs Stemmer.pm to highlight: ' . $@);
>         } else {
>             $self->{stemmer_function} = \&SWISH::Stemmer::SwishStem;
>         }
>     }
> 
> 
>     $self->{stopwords} = { map { $_, 1 } split /\s+/, $self->header('stopwords') };
> 
> 
>     $self->set_match_regexp;
> 
> 
>     return $self;
> }
> 
> 
> sub header {
>     my $self = shift;
>     return '' unless ref $self->{headers} eq 'HASH';
>     return $self->{headers}{$_[0]} || '';
> }
> 
> 
> #=========================================================================
> # Highlight a single property -- returns true if any words highlighted
> # no, no returns true really means that the text was processed and most
> # importantly HTML escaped.
> 
> sub highlight {
> 
>     my ( $self, $text_ref, $phrase_list ) = @_;
> 
>     my $wc_regexp = $self->{wc_regexp};
>     my $extract_regexp = $self->{extract_regexp};
> 
> 
>     my $last = 0;
> 
>     my $found_phrase = 0;
> 
>     my $settings = $self->{settings};
> 
>     my $Show_Words = $settings->{show_words} || 10;
>     my $Occurrences = $settings->{occurrences} || 5;
>     my $Max_Words = $settings->{max_words} || 100;
> 
> 
> 
>     my $On = $settings->{highlight_on} || '<b>';
>     my $Off = $settings->{highlight_off} || '</b>';
> 
>     my $on_flag  = 'sw' . time . 'on';
>     my $off_flag = 'sw' . time . 'off';
> 
> 
>     my $stemmer_function = $self->{stemmer_function};
> 
>     # Should really call unescapeHTML(), but then would need to escape <b> from escaping.
> 
> 
> 
>     # Split into "swish" words.  For speed, should work on a stream method.
>     my @words = split /$wc_regexp/, $$text_ref;
>     return unless @words;
> 
>     my @flags;  # This marks where to start and stop display.
>     $flags[$#words] = 0;  # Extend array.
> 
>     my $occurrences = $Occurrences;
> 
> 
>     my $word_pos = $words[0] eq '' ? 2 : 0;  # Start depends on if first word was wordcharacters or not
> 
> 
>     my $show_all_words = 0;
> 
>     # Remember, that the swish words are every other in @words.
> 
>     WORD:
>     while ( $Show_Words && $word_pos * 2 < @words ) {
> 
>         PHRASE:
>         foreach my $phrase ( @$phrase_list ) {
> 
>             print STDERR "  Search phrase '@$phrase'\n" if DEBUG_HIGHLIGHT;
>             next PHRASE if ($word_pos + @$phrase -1) * 2 > @words;  # phrase is longer than what's left
> 
> 
>             my $end_pos = 0;  # end offset of the current phrase
> 
>             # now compare all the words in the phrase
> 
>             my ( $begin, $word, $end );
> 
>             for my $match_word ( @$phrase ) {
> 
>                 my $cur_word = $words[ ($word_pos + $end_pos) * 2 ];
>                 unless ( $cur_word =~ /$extract_regexp/ ) {
> 
>                     my $idx = ($word_pos + $end_pos) * 2;
>                     my ( $s, $e ) = ( $idx - 10, $idx + 10 );
>                     $s = 0 if $s < 0;
>                     $e = @words-1 if $e >= @words;
> 
>                     warn  "Failed to  IgnoreFirst/Last from word '"
>                     . (defined $cur_word ? $cur_word : '*undef')
>                     . "' (index: $idx) word_pos:$word_pos end_pos:$end_pos total:"
>                     . scalar @words
>                     . "\n-search pharse words-\n"
>                     . join( "\n", map { "$_ '$phrase->[$_]'" } 0..@$phrase -1 )
>                     . "\n-Words-\n"
>                     . join( "\n", map { "$_ '$words[$_]'" . ($_ == $idx ? ' <<< this word' : '') } $s..$e )
>                     . "\n";
> 
>                     next PHRASE;
>                 }
> 
> 
> 
> 
>                 # Strip ignorefirst and ignorelast
>                 ( $begin, $word, $end ) = ( $1, $2, $3 );  # this is a waste, as it can operate on the same word over and over
> 
>                 my $check_word = lc $word;
> 
>                 if ( $end_pos && exists $self->{stopwords}{$check_word} ) {
>                     $end_pos++;
>                     print STDERR " Found stopword '$check_word' in the middle of phrase - * MATCH *\n" if DEBUG_HIGHLIGHT;
>                     redo if  ( $word_pos + $end_pos ) * 2 < @words;  # go on to check this match word with the next word.
> 
>                     # No more words to match with, so go on to next pharse.
>                     next PHRASE;
>                 }
> 
>                 if ( $stemmer_function ) {
>                     my $w = $stemmer_function->($check_word);
>                     $check_word = $w if $w;
>                 }
> 
> 
> 
>                 print STDERR "     comparing source # (word:$word_pos offset:$end_pos) '$check_word' == '$match_word'\n" if DEBUG_HIGHLIGHT;
> 
>                 if ( substr( $match_word, -1 ) eq '*' ) {
>                     next PHRASE if index( $check_word, substr($match_word, 0, length( $match_word ) - 1) ) != 0;
> 
>                 } else {
>                     next PHRASE if $check_word ne $match_word;
>                 }
> 
> 
>                 print STDERR "      *** Word Matched '$check_word' *** \n" if DEBUG_HIGHLIGHT;
>                 $end_pos++;
>             }
> 
>             print STDERR "      *** PHRASE MATCHED (word:$word_pos offset:$end_pos) *** \n" if DEBUG_HIGHLIGHT;
> 
> 	    $found_phrase++;
> 
> 
>             # We are currently at the end word, so it's easy to set that highlight
> 
>             $end_pos--;
> 
>             if ( !$end_pos ) { # only one word
>                 $words[$word_pos * 2] = "$begin$on_flag$word$off_flag$end";
>             } else {
>                 $words[($word_pos + $end_pos) * 2 ] = "$begin$word$off_flag$end";
> 
>                 #Now, reload first word of match
>                 $words[$word_pos * 2] =~ /$extract_regexp/ or die "2 Why didn't '$words[$word_pos]' =~ /$extract_regexp/?";
>                 # Strip ignorefirst and ignorelast
>                 ( $begin, $word, $end ) = ( $1, $2, $3 );  # probably should cache this!
>                 $words[$word_pos * 2] = "$begin$on_flag$word$end";
>             }
> 
> 
>             # Now, flag the words around to be shown
> 	    my ($start, $stop);
> 	    if (!$show_all_words) {
> 		$start = ($word_pos - $Show_Words + 1) * 2;
> 		$stop   = ($word_pos + $end_pos + $Show_Words - 2) * 2;
> 		if ( $start < 0 ) {
> 		    $stop = $stop - $start;
> 		    $start = 0;
> 		}
> 
> 		$stop = $#words if $stop > $#words;
> 
> 		$show_all_words = 1 if ($start == 0 && $stop == $#words);
> 		if (!$show_all_words) { $flags[$_]++ for $start .. $stop; }
> 	    }
> 
> 
>             # All done, and mark where to stop looking
>             if ( --$occurrences <= 0 ) {
>                 $last = $stop;
>                 last WORD;
>             }
> 
> 
>             # Now reset $word_pos to word following
>             $word_pos += $end_pos; # continue will still be executed
>             next WORD;
>         }
>     } continue {
>         $word_pos ++;
>     }
> 
> 
> 
> 
>     my $dotdotdot = ' ... ';
> 
> 
>     my @output;
> 
>     my $printing;
>     my $first = 1;
>     my $some_printed;
> 
>     if ( $Show_Words && @words > 50 ) {  # don't limit context if a small number of words
>         for my $i ( 0 ..$#words ) {
> 
> 
>             if ( $last && $i >= $last && $i < $#words ) {
>                 push @output, $dotdotdot;
>                 last;
>             }
> 
>             if ( $show_all_words || $flags[$i] ) {
> 
>                 push @output, $dotdotdot if !$printing++ && !$first;
>                 push @output, $words[$i];
>                 $some_printed++;
> 
>             } else {
>                 $printing = 0;
>             }
> 
> 	    $first = 0;
> 
> 
>         }
>     }
> 
>     if ( !$some_printed ) {
>         for my $i ( 0 .. $Max_Words ) {
>             if ( $i > $#words ) {
>                 $printing++;
>                 last;
>             }
>             push @output, $words[$i];
>         }
>     }
> 
> 
> 
>     push @output, $dotdotdot if !$printing;
> 
>     $$text_ref = join '', @output;
>     my %entities = (
>         '&' => '&amp;',
>         '>' => '&gt;',
>         '<' => '&lt;',
>         '"' => '&quot;',
>     );
>     my %highlight = (
>         $on_flag => $On,
>         $off_flag => $Off,
>     );
> 
> 
>     $$text_ref =~ s/([&"<>])/$entities{$1}/ge;  # " fix emacs
> 
>     $$text_ref =~ s/($on_flag|$off_flag)/$highlight{$1}/ge;
> 
>     return 1;  # Means that prop was processed AND was html escaped.
>     return $found_phrase;
> 
>     # $$text_ref = join '', @words;  # interesting that this seems reasonably faster
> 
> 
> 
> }
> 
> #============================================
> # Returns compiled regular expressions for matching
> #
> #
> 
> sub set_match_regexp {
>     my $self = shift;
> 
> 
> 
>     my $wc = $self->header('wordcharacters');
>     my $ignoref = $self->header('ignorefirstchar');
>     my $ignorel = $self->header('ignorelastchar');
> 
> 
>     $wc = quotemeta $wc;
> 
>     #Convert query into regular expressions
> 
> 
>     for ( $ignoref, $ignorel ) {
>         if ( $_ ) {
>             $_ = quotemeta;
>             $_ = "([$_]*)";
>         } else {
>             $_ = '()';
>         }
>     }
> 
> 
>     $wc .= 'ABCDEFGHIJKLMNOPQRSTUVWXYZ';  # Warning: dependent on tolower used while indexing
> 
> 
>     # Now, wait a minute.  Look at this more, as I'd hope that making a
>     # qr// go out of scope would release the compiled pattern.
> 
>     if ( $ENV{MOD_PERL} ) {
>         $self->{wc_regexp}      = qr/([^$wc]+)/;                     # regexp for splitting into swish-words
>         $self->{extract_regexp} = qr/^$ignoref([$wc]+?)$ignorel$/i;  # regexp for extracting out the words to compare
> 
>      } else {
>         $self->{wc_regexp}      = qr/([^$wc]+)/o;                    # regexp for splitting into swish-words
>         $self->{extract_regexp} = qr/^$ignoref([$wc]+?)$ignorel$/oi;  # regexp for extracting out the words to compare
>      }
> }
> 
> 1;
> 
> 
> 
> 
> --------------070605000707000702040506--

-- 
Peter Karman - Software Publications Engineer - Cray Inc
phone: 651-605-9009 - mailto:karman@cray.com
Received on Wed Jul 7 12:24:37 2004