Skip to main content.
home | support | download

Back to List Archive

Re: Stopwords/Stemming

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Sun Jun 05 2005 - 16:06:23 GMT
On Fri, Jun 03, 2005 at 12:39:54PM -0700, Brad Miele wrote:
> So it seems like it is stemming the word and then comparing it against the
> stopwords. Does this seem like a correct assesment?

Yes, the logic is wrong.  Stopwords are removed after applying
stemming when searching, but before when indexing.

When searching the code goes something like this:

    parse_swish_query()
        tokenize_query_string()
            tokenize by white space and operator characters
            lower case words
            check for buzzwords
            parse_swish_words() -- convert into swish word:
                apply TranslateChars
                tokenize again based on wordcharacters/begin/endchars.
                (where stopwords were removed before)
                limit by max word size
                apply fuzzy translation
            remove stopwords

So, yes, the IgnoreWords list is applied after stemming when
searching.  I think it can be debated what should be done first --
fuzzy translation or stopword removal.  For stemming seems like you
might want to do it after ("IgnoreWords run" should remove all forms:
runs running if using stemming), but for things like soundex you would
want it to apply before (you don't want to enter soundex codes into
your stopword list).

This is back to the issue of the query parser needing a rewrite.
Might be able to just move the stopword check back to where they were
removed before, but there's some notes in the source about why that's
not done, so I'd need to check up on that first.


Indexing goes something like this:

    indexstring()
        next_word()
            tokenize by whte space
            lower case word
            check for buzzwords
            next_swish_word()
                tokenize into "swish words" based on Wordcharacters, etc.
            make sure word starts with begin/endchars.
            limit by
                stopwords
                word length
                consecutive digits
                consecutive vowels
                consecutive consonants
            apply fuzzy translation


-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Sun Jun 5 09:06:24 2005