Skip to main content.
home | support | download

Back to List Archive

RE: Failing to find a word

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Thu Oct 07 1999 - 21:47:00 GMT
At 09:25 PM 10/6/99 -0500, David Norris wrote:
>You could add a printf statement to see how
>the 'word' array is being transformed in the Stem() function in
>stemmer.c.

I now need a bit more help -- my C skills are very weak, and I don't follow
search.c and index.c all that well.  (And thanks to David, though, I can
build swish on my PC and try a few things.)

Here's the problem from my poor reading of search.c

The word 'database' is in a file to be indexed.  With stemming enabled,
Swish stems the word to 'databas' and places that word in the index (see my
previous post for -D output).

Now, searching for 'data*' expandstar() in search.c grabs all words out of
the index that start with 'data'.  In this case it finds only 'databas' and
uses that as the search word.  Since stemming is enabled, Swish, rightly
so, stems the search words.  But in this case 'databas' stems further into
'databa', which, of course, is NOT in the index.

It's hard to know where the error is, and what should be fixed.

Stem() could be modified to continue stemming until a word will not stem.
But, in my opinion, search.c is really where there is a problem with the
program's logic.  

The words entered in the query should be stemmed, before the expandstar()
routine, not after.  And not just because of this double-stemming problem.

For example, consider a source file with the word 'runs', which Swish stems
and places in the index as 'run'.  Searching for running, runs, and r*, all
work.

But...

E:\swish\perl\x>swish -w runs*  
# Search words: runs*
 Stemming thisisnotaword
 Stemmed thisisnotaword
err: no results

What's happening here is expandstar(), and thus getmatchword(), is trying
to find all the words that begin with 'runs' in the index to use in the
expanded search query.  But 'runs' isn't in the index, its stem 'run' is in
the index.  So this fails.

So, modifying search.c to stem the query words before expanding is the best
solution, and means that Stem() is called less if expandstar() generates a
large list of words to match against.  (Why pull a bunch of stemmed words
out of the index, and then stem them once again?)

It would be nice to fix Stem(), too, not so much for it's failure to stem a
word completely (which probably doesn't matter), but to keep Stem() from
stemming words into nonexistence and thus leaving them out of the index.

As my C skills are lacking, can anyone help with or recommend some code
changes?

Thanks,



Bill Moseley
mailto:moseley@hank.org
Received on Thu Oct 7 15:13:43 1999