Skip to main content.
home | support | download

Back to List Archive

optimization for PhraseHighlight.pm

From: Bill Schell <friedfish(at)not-real.optonline.net>
Date: Wed Jul 07 2004 - 19:01:15 GMT
This is a multi-part message in MIME format.
--------------070605000707000702040506
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Greetings,  fellow swishians!
Or is it swishies? swishoids? swishers?  I think I like 
swishians...sounds like
the citizens of a tiny country, probably in eastern europe.

I have been using PhraseHighlight.pm (through swish.cgi and otherwise) 
to highlight
search terms in whole documents, rather than just small sections, and 
found that for even
medium size documents (tens of pages) that it was too slow.  I was 
burning 9 seconds
of CPU just to do the highlighting in my test 53000 word document.    
This has impact not
only for people doing things like I am, but for anyone who has high 
settings for show_words
and max_words in their swishcgi.conf file.

After a little use of the perl 'smallprof' profiler (counts times 
individual lines are executed),
I discovered that line 213 of PhraseHighlight.pm was being executed 47.9 
million times
for this document, accounting for over 90% of the CPU consumption.   It 
turns out that
this statement: "$flags[$_]++"  is executed: 
(words in document  *  phrase_cont  * words_per_phrase   * 
words_in_document)
times.  

Looked like a prime optimization candidate.   The intention of the 
$flags array is to flag
words in the document that will later be output.  I have added a check 
before the
$flag setting code code  to determine if all the words in the document 
are involved. 
If so, it sets a $show_all_words flag once, and never sets anything in 
$flag. 
Later, the document outputting code looks at $show_all_words and outputs the
whole document if it is set, otherwise it looks at $flags.   CPU usage 
dropped from 9 seconds
to less than 1 second (2Ghz x86 linux).

Modified PhraseHighlight.pm attached.  Diff listing follows (in case the 
list strips off
the attachment).

Bill Schell

$ diff PhraseHighlight.pm PhraseHighlight.pm.orig
101d100
<     my $show_all_words = 0;
205,218c204,213
<           my ($start, $stop);
<           if (!$show_all_words) {
<               $start = ($word_pos - $Show_Words + 1) * 2;
<               $stop   = ($word_pos + $end_pos + $Show_Words - 2) * 2;
<               if ( $start < 0 ) {
<                   $stop = $stop - $start;
<                   $start = 0;
<               }
<
<               $stop = $#words if $stop > $#words;
<
<               $show_all_words = 1 if ($start == 0 && $stop == $#words);
<               if (!$show_all_words) { $flags[$_]++ for $start .. $stop; }
<           }
---
 >             my $start = ($word_pos - $Show_Words + 1) * 2;
 >             my $stop   = ($word_pos + $end_pos + $Show_Words - 2) * 2;
 >             if ( $start < 0 ) {
 >                 $stop = $stop - $start;
 >                 $start = 0;
 >             }
 >
 >             $stop = $#words if $stop > $#words;
 >
 >             $flags[$_]++ for $start .. $stop;
257c252
<             if ( $show_all_words || $flags[$i] ) {
---
 >             if ( $flags[$i] ) {





--------------070605000707000702040506
Content-Type: text/plain;
 name="PhraseHighlight.pm"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
 filename="PhraseHighlight.pm"

#=======================================================================
#  Phrase Highlighting Code
#
#    $Id: PhraseHighlight.pm,v 1.2 2003/05/20 00:56:20 whmoseley Exp $
#=======================================================================
package SWISH::PhraseHighlight;
use strict;

use constant DEBUG_HIGHLIGHT => 0;

sub new {
    my ( $class, $settings, $headers ) = @_;



    my $self = bless {
        settings => $settings,
        headers  => $headers,
    }, $class;



    if ( $self->header('stemming applied') =~ /^(?:1|yes)$/i ) {
        eval { require SWISH::Stemmer };
        if ( $@ ) {
            warn('Stemmed index needs Stemmer.pm to highlight: ' . $@);
        } else {
            $self->{stemmer_function} = \&SWISH::Stemmer::SwishStem;
        }
    }


    $self->{stopwords} = { map { $_, 1 } split /\s+/, $self->header('stopwords') };


    $self->set_match_regexp;


    return $self;
}


sub header {
    my $self = shift;
    return '' unless ref $self->{headers} eq 'HASH';
    return $self->{headers}{$_[0]} || '';
}


#=========================================================================
# Highlight a single property -- returns true if any words highlighted
# no, no returns true really means that the text was processed and most
# importantly HTML escaped.

sub highlight {

    my ( $self, $text_ref, $phrase_list ) = @_;

    my $wc_regexp = $self->{wc_regexp};
    my $extract_regexp = $self->{extract_regexp};


    my $last = 0;

    my $found_phrase = 0;

    my $settings = $self->{settings};

    my $Show_Words = $settings->{show_words} || 10;
    my $Occurrences = $settings->{occurrences} || 5;
    my $Max_Words = $settings->{max_words} || 100;



    my $On = $settings->{highlight_on} || '<b>';
    my $Off = $settings->{highlight_off} || '</b>';

    my $on_flag  = 'sw' . time . 'on';
    my $off_flag = 'sw' . time . 'off';


    my $stemmer_function = $self->{stemmer_function};

    # Should really call unescapeHTML(), but then would need to escape <b> from escaping.



    # Split into "swish" words.  For speed, should work on a stream method.
    my @words = split /$wc_regexp/, $$text_ref;
    return unless @words;

    my @flags;  # This marks where to start and stop display.
    $flags[$#words] = 0;  # Extend array.

    my $occurrences = $Occurrences;


    my $word_pos = $words[0] eq '' ? 2 : 0;  # Start depends on if first word was wordcharacters or not


    my $show_all_words = 0;

    # Remember, that the swish words are every other in @words.

    WORD:
    while ( $Show_Words && $word_pos * 2 < @words ) {

        PHRASE:
        foreach my $phrase ( @$phrase_list ) {

            print STDERR "  Search phrase '@$phrase'\n" if DEBUG_HIGHLIGHT;
            next PHRASE if ($word_pos + @$phrase -1) * 2 > @words;  # phrase is longer than what's left


            my $end_pos = 0;  # end offset of the current phrase

            # now compare all the words in the phrase

            my ( $begin, $word, $end );

            for my $match_word ( @$phrase ) {

                my $cur_word = $words[ ($word_pos + $end_pos) * 2 ];
                unless ( $cur_word =~ /$extract_regexp/ ) {

                    my $idx = ($word_pos + $end_pos) * 2;
                    my ( $s, $e ) = ( $idx - 10, $idx + 10 );
                    $s = 0 if $s < 0;
                    $e = @words-1 if $e >= @words;

                    warn  "Failed to  IgnoreFirst/Last from word '"
                    . (defined $cur_word ? $cur_word : '*undef')
                    . "' (index: $idx) word_pos:$word_pos end_pos:$end_pos total:"
                    . scalar @words
                    . "\n-search pharse words-\n"
                    . join( "\n", map { "$_ '$phrase->[$_]'" } 0..@$phrase -1 )
                    . "\n-Words-\n"
                    . join( "\n", map { "$_ '$words[$_]'" . ($_ == $idx ? ' <<< this word' : '') } $s..$e )
                    . "\n";

                    next PHRASE;
                }




                # Strip ignorefirst and ignorelast
                ( $begin, $word, $end ) = ( $1, $2, $3 );  # this is a waste, as it can operate on the same word over and over

                my $check_word = lc $word;

                if ( $end_pos && exists $self->{stopwords}{$check_word} ) {
                    $end_pos++;
                    print STDERR " Found stopword '$check_word' in the middle of phrase - * MATCH *\n" if DEBUG_HIGHLIGHT;
                    redo if  ( $word_pos + $end_pos ) * 2 < @words;  # go on to check this match word with the next word.

                    # No more words to match with, so go on to next pharse.
                    next PHRASE;
                }

                if ( $stemmer_function ) {
                    my $w = $stemmer_function->($check_word);
                    $check_word = $w if $w;
                }



                print STDERR "     comparing source # (word:$word_pos offset:$end_pos) '$check_word' == '$match_word'\n" if DEBUG_HIGHLIGHT;

                if ( substr( $match_word, -1 ) eq '*' ) {
                    next PHRASE if index( $check_word, substr($match_word, 0, length( $match_word ) - 1) ) != 0;

                } else {
                    next PHRASE if $check_word ne $match_word;
                }


                print STDERR "      *** Word Matched '$check_word' *** \n" if DEBUG_HIGHLIGHT;
                $end_pos++;
            }

            print STDERR "      *** PHRASE MATCHED (word:$word_pos offset:$end_pos) *** \n" if DEBUG_HIGHLIGHT;

	    $found_phrase++;


            # We are currently at the end word, so it's easy to set that highlight

            $end_pos--;

            if ( !$end_pos ) { # only one word
                $words[$word_pos * 2] = "$begin$on_flag$word$off_flag$end";
            } else {
                $words[($word_pos + $end_pos) * 2 ] = "$begin$word$off_flag$end";

                #Now, reload first word of match
                $words[$word_pos * 2] =~ /$extract_regexp/ or die "2 Why didn't '$words[$word_pos]' =~ /$extract_regexp/?";
                # Strip ignorefirst and ignorelast
                ( $begin, $word, $end ) = ( $1, $2, $3 );  # probably should cache this!
                $words[$word_pos * 2] = "$begin$on_flag$word$end";
            }


            # Now, flag the words around to be shown
	    my ($start, $stop);
	    if (!$show_all_words) {
		$start = ($word_pos - $Show_Words + 1) * 2;
		$stop   = ($word_pos + $end_pos + $Show_Words - 2) * 2;
		if ( $start < 0 ) {
		    $stop = $stop - $start;
		    $start = 0;
		}

		$stop = $#words if $stop > $#words;

		$show_all_words = 1 if ($start == 0 && $stop == $#words);
		if (!$show_all_words) { $flags[$_]++ for $start .. $stop; }
	    }


            # All done, and mark where to stop looking
            if ( --$occurrences <= 0 ) {
                $last = $stop;
                last WORD;
            }


            # Now reset $word_pos to word following
            $word_pos += $end_pos; # continue will still be executed
            next WORD;
        }
    } continue {
        $word_pos ++;
    }




    my $dotdotdot = ' ... ';


    my @output;

    my $printing;
    my $first = 1;
    my $some_printed;

    if ( $Show_Words && @words > 50 ) {  # don't limit context if a small number of words
        for my $i ( 0 ..$#words ) {


            if ( $last && $i >= $last && $i < $#words ) {
                push @output, $dotdotdot;
                last;
            }

            if ( $show_all_words || $flags[$i] ) {

                push @output, $dotdotdot if !$printing++ && !$first;
                push @output, $words[$i];
                $some_printed++;

            } else {
                $printing = 0;
            }

	    $first = 0;


        }
    }

    if ( !$some_printed ) {
        for my $i ( 0 .. $Max_Words ) {
            if ( $i > $#words ) {
                $printing++;
                last;
            }
            push @output, $words[$i];
        }
    }



    push @output, $dotdotdot if !$printing;

    $$text_ref = join '', @output;
    my %entities = (
        '&' => '&amp;',
        '>' => '&gt;',
        '<' => '&lt;',
        '"' => '&quot;',
    );
    my %highlight = (
        $on_flag => $On,
        $off_flag => $Off,
    );


    $$text_ref =~ s/([&"<>])/$entities{$1}/ge;  # " fix emacs

    $$text_ref =~ s/($on_flag|$off_flag)/$highlight{$1}/ge;

    return 1;  # Means that prop was processed AND was html escaped.
    return $found_phrase;

    # $$text_ref = join '', @words;  # interesting that this seems reasonably faster



}

#============================================
# Returns compiled regular expressions for matching
#
#

sub set_match_regexp {
    my $self = shift;



    my $wc = $self->header('wordcharacters');
    my $ignoref = $self->header('ignorefirstchar');
    my $ignorel = $self->header('ignorelastchar');


    $wc = quotemeta $wc;

    #Convert query into regular expressions


    for ( $ignoref, $ignorel ) {
        if ( $_ ) {
            $_ = quotemeta;
            $_ = "([$_]*)";
        } else {
            $_ = '()';
        }
    }


    $wc .= 'ABCDEFGHIJKLMNOPQRSTUVWXYZ';  # Warning: dependent on tolower used while indexing


    # Now, wait a minute.  Look at this more, as I'd hope that making a
    # qr// go out of scope would release the compiled pattern.

    if ( $ENV{MOD_PERL} ) {
        $self->{wc_regexp}      = qr/([^$wc]+)/;                     # regexp for splitting into swish-words
        $self->{extract_regexp} = qr/^$ignoref([$wc]+?)$ignorel$/i;  # regexp for extracting out the words to compare

     } else {
        $self->{wc_regexp}      = qr/([^$wc]+)/o;                    # regexp for splitting into swish-words
        $self->{extract_regexp} = qr/^$ignoref([$wc]+?)$ignorel$/oi;  # regexp for extracting out the words to compare
     }
}

1;




--------------070605000707000702040506--
Received on Wed Jul 7 12:01:25 2004