Skip to main content.
home | support | download

Back to List Archive

Re: SwishFuzzyWordError() and missing stemmer constants

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Mon Jan 29 2007 - 16:47:15 GMT
On Sun, Jan 28, 2007 at 11:49:20PM -0800, Antony Dovgal wrote:
> According to the documentation, SwishFuzzyWordError() return values are 
> defined in src/stemmer.h file, and this is true. 
> Though, this fact actually makes it impossible to use these values because 
> stemmer.h is not a public header and used only internally.
> 
> Also, it's not really clear if one should use this function or it's not recommended/deprecated/etc.
> The documentation of SwishFuzzyWordError() almost does not shed a light:
> 
> "Not all stemmers set this value correctly." - well, this means at least some of them 
> DO return correct values. That's better than nothing.
> Maybe it's time to fix those returning incorrect values?

The stemming code in swish mixes "stemmers" from different sources.
So not all errors apply to all stemmers.

$ fgrep ' STEM_' *.c
soundex.c:        fw->error =  STEM_WORD_TOO_BIG;
soundex.c:            fw->error = STEM_NOT_ALPHA;
soundex.c:            fw->error = STEM_TOO_SMALL;
soundex.c:                       return STEM_OK;  /* Hum, probably not right */
stemmer.c:    fw->error = STEM_OK;                    /* default to OK */
stemmer.c:        fw->error = STEM_TO_NOTHING;


> "But since SwishFuzzyWordList() will return a valid string regardless of the return value, 
> you can often just ignore this setting. That's what I do." - how often should I ignore it? =)
> I mean, if the value of this function should be ignored, then the function itself is useless.

It's not important to swish -- swish just passes in words and if
there's a problem (like the word can't be stemmed) then it uses the
un-stemmed word for indexing and searching.

You might have some need for that error code outside of swish, though
(say to test and flag which words in a query were stemmed).

It's been a long time since I looked at the Snowball API, but looking
at this bit of code:

    fi->stemmer->lang_stem(snowball); /* Stem the word */


    if ( 0 == snowball->l )
    {
        fw->error = STEM_TO_NOTHING;
        return fw;
    }

Shouldn't the return value of calling lang_stem() be tested?  Or maybe
testing the length is fine.  I'm not sure.

> Hence the question: 
> Would you accept a patch exporting those constants to public (and changing the 
> function prototype appropriately) or should I forget about SwishFuzzyWordError()?
> See diff against current CVS in attachment.

I think the patch makes sense.  I'm not sure why the STEM_RETURNS
struct was not made public.



-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Mon Jan 29 08:47:20 2007