Skip to main content.
home | support | download

Back to List Archive

Re: stemming and swish-2.0-beta1

From: Bill Moseley <moseley(at)>
Date: Tue Jun 27 2000 - 19:23:05 GMT
At 08:32 AM 06/27/00 -0700, Jose Manuel Ruiz wrote:
>swish-e-2.0 works as follows:

>1- stem in run* -> need to check it. Probably it will not work with the
>2- get results for run* in just one call to file index
>3- show results
>As you can see, this is much more efficient proccess. Just try to search
>for r* and you will experience it.


>It is very easy. I can add something like:
>if (applyStemmingRules || applySoundexRules)
>  searchwordlist=expandstar(searchwordlist, fp);
>but this will give you bad performance once again (only when Stemming

I'm not sure I fully understand the difference in wildcard processing with
2.0 vs. pre 2.0.

In pre-processing queries before sending to swish I used to take each word,
and if it had an ending "*" then I would stem the word and stick the "*"
back on.

That is, a query for:

        swish-e -w 'word runs*'
        swish-e -w 'word running*'

would get turned into this query after stemming "runs" or "running":

        swish-e -w 'word run*'

If I didn't do that then those searches would fail.  

Swish used to expand wild cards first, then apply stemming to all the words
it found in the index -- clearly a error in logic as the words in the index
were already stemmed.  To make things worse, stemmer.c might take an
already stemmed word and stem it more.  So, in effect, swish would lookup a
word in its index, and then search the index again with that word and not
find it!

Again, I'm not clear on how 2.0 does the wildcard searches, but could swish
just stem wildcard terms before searching and still retain the speed gains
of 2.0?

Here's my old query pre-processing code in perl I used before patching
swish to stem before expanding wild cards.  I'm not 100% sure this matches
the process swish uses during indexing, but oh well.

    return grep {
        /^[$BeginCharacters"]/o &&   # Words must begin or end with these
        /[$EndCharacters*"]$/o       # '*' allowed

    } map {
        s/^[$IgnoreFirstChar]+//o;   # Remove leading chars
        s/[$IgnoreLastChar]+(?=\*?$)//o;    # Remove trailing chars

        ( m/(.+)\*$/ && $Stemming )  # stem wild cards since Swish won't
        ? (stem_words($1))[0] . '*'
        : $_;

    } split /[^$WordCharacters*"]/o, lc $words;
                                    # Words first get defined here

Bill Moseley
Received on Tue Jun 27 15:38:14 2000