Skip to main content.
home | support | download

Back to List Archive

Re: fix for my stemmer_en2 issue

From: Peter Karman <peter(at)>
Date: Sat Nov 11 2006 - 21:51:21 GMT
Bill Moseley scribbled on 11/11/06 9:47 AM:
> On Fri, Nov 10, 2006 at 09:24:12PM -0800, Peter Karman wrote:
>> The difference when I put them back in however was that instead of being 
>> dropped from stemmer.h at the same time.
>> To make matters more confusing, the error message indicates that the deprecated 
>> features Stemming_en and Stem will use Stemmer_en1 -- but they are marked with 
>> FUZZY_STEMMING_EN2 even though they call the same init/free functions as 
>> Stemmer_en1.
> Oh, that's not good.
>> So, there's definitely something suspicious in stemmer.c I think. I'm going to 
>> commit a change to CVS -- Brad, would you take a look at the CVS version and see 
>> if that works any better?
> This will require re-indexing.  That table maps the configuration
> names to an index number used to indicate the stemmer -- and that
> number is stored in the index to know what stemmer to use when
> searching.

yes, I got to thinking about this some more last night after I checked in that 
change to stemmer.c. I think there was also a problem in stemmer.h, since I had 
removed FUZZY_STEMMING_EN altogether, which basically meant there was an 
off-by-1 difference in 2.4.3 vs 2.4.4 indexes wrt to the stemmer. It would only 
manifest if you used stemming (which I don't) and searched a 2.4.3 index using 
2.4.4 library.

> Brad's original config had:
>     FuzzyIndexingMode Stemming_en2
> which mapped to the "english" stemmer and stored FUZZY_STEMMING_EN2 in
> the index.  Then when searching FUZZY_STEMMING_EN2 was searched in the
> table and found the "porter" stemmer as could be seen in his headers:
>     # Fuzzy Mode: Stemming_en
> Which could cause problems.  What I'm still confused about is why the
> size of the index would have made a difference.

that might be a red herring.

The header reports Stemming_en because of the order of the fuzzy_opts[] array. 
Last night's stemmer.c just reordered those. get_fuzzy_mode() just picks the 
first FUZZY_STEMMING_EN2 it finds. The stemmer.c I'm about to check in further 
reorders fuzzy_opts[] to put the deprecated options last in the list, so they 
don't get listed first and confuse folks.

> Peter, that fuzzy_mode index must match up to only one stemmer, but
> there can be multiple entires for a give fuzzy_mode to allow for
> aliases (Stem, Stemming_en, Stemming_en1, for example).

I think CVS is right now. I checked in a new stemmer.h just a little bit ago 
that should provide backwards compat with 2.4.3 indexes (by putting back in the 
enum value that could cause off-by-one), and the stemmer.c I just checked in 
should be a little saner.

I see Brad reports that last night's stemmer.c change did the trick for his 
particular case; I suspect it was the EN1/EN2 mixup that was at fault.

I did discover, while checking 2.4.3, 2.4.4 and CVS, that 2.4.4 did in fact 
break the 2.4.3 index format in some way. Unknown header 32. I suspect it's 
related to the RemovedWords/RemovedFiles features of the increm version, in 
db_read.c. But not sure on that. There were a lot of changes in db_read.c 
shortly after 2.4.3 was released, which is a lot of water under the bridge 
before 2.4.4 was released...

What that means is that indexes created with 2.4.3 can be read by 2.4.4, but not 
the other way around. That's actually ok, I think, since I assume that the 
working case is going to be the most common. However, it wasn't clearly 
documented in the Changes anywhere, and we likely should have changed the Magic 
Number to make it explicit. I am not going to change the Magic Number now, since 
that defeats the fix I put in stemmer.h with making CVS backwards compatible 
with 2.4.3. But it ought to get changed for 2.4.5 (whenever that is...).


Peter Karman  .  .  peter(at)
Received on Sat Nov 11 13:51:27 2006