Skip to main content.
home | support | download

Back to List Archive

Re: RE: stemming

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Fri Nov 19 1999 - 06:15:24 GMT
At 08:41 PM 11/18/99 -0800, you wrote:
>> search for "rockies", but do contain "rock"?
>
>This is probably correct.  The stemmer simply strips the suffix from a
>word.  "Rockies" is "Rocki".  "Supplies" is "Suppli"
>
>> if I search for "rock" instead of "rockies",
>> I get hits I did not get the first time.
>
>This is possibly a bug.  The stemmer will fail to match some words
>when it really should.  I do not know a proper way around the problem.

No, that's how it works.

rockies stems to rocki.  So searching for rockies will find documents with
rockies in it (and anything else that stems to rocki).

rock, rocks, rocking, rockings, rocked all stem to rock.  So searching for
rock will find documents with any of those words, but not rockies.

The original poster might check that there aren't other words that might
stem to rocki in the returned document.  I can't think of any right now.  I
can't get Swish to find rock when searching for rockies.  I tried it on a
fresh downloaded and compiled SWISH 1.3.2.

>If you want to more easily debug the stemmer, or another word filter,
>then I have written a wrapper (extract it to your swish-e directory):
>http://www.webaugur.com/wares/files/wordtest.tar.gz

Also, don't forget about the -D option to look at how swish is actually
indexing (and stemming) words.

You can find out more in the Swish list archive, but you should know that
wild card searches don't work as expected with Stemming.  Swish stores just
the stem of indexed words in the index.  But searching for rocking* won't
find rocking or rockings because 'rocking*' fails to stem because of the
asterisk.

Another problem is in how Swish expands wild cards.  If you search for
'ro*' Swish expands the wild card to all words in the index that start with
ro.  These words from the index are used to build the expanded search
string.  So, searching for "ro*" gets changed into a search that looks like
"(rocki or rock)" (which includes rocking, rocks, rockies).  This new
expanded search string is used to generate the results.  

Since stemming is in use, Swish stems the search words before finally
searching the index for the results.  Swish is stemming the expanded search
words twice -- once during indexing, and once again during searching.
Unfortunately,  the stemming routine doesn't fully stem some words, so
feeding previously stemmed words to the stemmer results in words that
aren't in the index.  In other words, Swish perfectly good words out of the
index, stems them, and then fails to find them in the index again.



Bill Moseley
mailto:moseley@hank.org
Received on Thu Nov 18 22:30:43 1999