Skip to main content.
home | support | download

Back to List Archive

stemming and swish-2.0-beta1

From: Jose Manuel Ruiz <jmruiz(at)not-real.boe.es>
Date: Tue Jun 27 2000 - 15:32:08 GMT
Hi all,

Here are some very interesting words from Bill Moseley. 
Any more comments will be appreciated.

> I had a patch in swish to stem wild card words before expanding with
> expandstar.  (It was the other way around in older versions.)  It looks
> like you have reorganized the way that works, and wild card searches are
> not working as they were in my patched version.
> 
> For example, with stemming enabled, searching for "run" "runs" or "running"
> should all stem to "run" and find the same results, and they do.
> 
> It is debatable what should happen when mixing wild cards and stemming.
> For example, searching for "runn*" won't find "running" because "running"
> is stored in the index as "run" and doesn't match "runn".
> 

You are totally right. 
swish-e-1.x works as follows:
1- expandstar translates run* into "runa or runb or run ..."
2- for each word:
. Stem word
. get word results
3- "or" of all results
4- show results

The "3" line can be terrible if you just put "r*"

The problem for this is performance. If you look for "r*" you
can see a slow response.
swish-e-2.0 works as follows:
1- stem in run* -> need to check it. Probably it will not work with the
'*'
2- get results for run* in just one call to file index
3- show results
As you can see, this is much more efficient proccess. Just try to search
for r* and you will experience it.

> I guess I would argue, though, that searching for "running" and searching
> for "running*" (or "runs" and "runs*") should return the same results.  So
> that's why I had the patch to stem words before expanding with expandstar.
> So searching for "running*" would get stemmed to search for "run*" which
> would find all the "run" words in the index, as expected.
> 
> Would it be difficult to make the new version also stem before expanding
> the wild card search?
> 
It is very easy. I can add something like:
if (applyStemmingRules || applySoundexRules)
{
  searchwordlist=expandstar(searchwordlist, fp);
}
but this will give you bad performance once again (only when Stemming
enabled).


> One other question:  In searching you now split up words by WordCharacters.
>  Just so I understand, do you merge the WordCharacters from each index file
> into one set of characters?  That is, you don't process the search terms
> once per index file, but rather once for the entire search using merged
> WordCharacters and other settings?
> 
Yes, I merge all the WordCharacters and other settings. I do the same
when
merging index files.

cu
Jose
Received on Tue Jun 27 11:47:29 2000