Skip to main content.
home | support | download

Back to List Archive

Re: stemming and swish-2.0-beta1

From: Jose Manuel Ruiz <jmruiz(at)not-real.boe.es>
Date: Wed Jun 28 2000 - 09:27:29 GMT
Hi Bill,

This is just a temporal fix to stemming in wildcard search. Let me know
if it works.

Change the following lines in function operate in search.c (line number 1048):

        if (applyStemmingRules)
        {
                /* apply stemming algorithm to the search term */
                Stem(word, MAXWORDLEN); /* CAREFUL! word length is assumed */
        }

by

        if (applyStemmingRules)
        {
                /* apply stemming algorithm to the search term */
                i=strlen(word)-1;
                if(i && word[i]=='*') {
                        word[i]='\0';
                } else i=0;   /* No star */
                Stem(word, MAXWORDLEN); /* CAREFUL! word length is assumed */
                if(i) strcat(word,"*"); /* restore the star - Need to check
lentgh of the string? */
        }

If you search for "word*", ir trims the '*', then stems "word" and restore the
"*".
So for "running*"
running* --> running --Stem(running)-->run-->run*

for "runs*"
runs* --> runs --Stem(runs)-->run-->run*

and for "runn*"
runn* --> runn --Stem(runn)-->runn-->runn*

This will guarantee good performance.

BTW, wildcards are treated like normal words. The only difference is in
function getfileinfo:
- Search for a normal word (no wildcard) is made using a fast hash approach
- Search for a wildcard word is made using a sequential approach (words are
sorted in the
index file). So, it returns all the data for all the words, without using an
"or" function, getting
all the data at once. For this reason the performace is better.

Another thing, I think  the same fix  must be applied to soundex...right?

Waiting to hear from you.

cu
Jose
Received on Wed Jun 28 05:49:21 2000