Skip to main content.
home | support | download

Back to List Archive

Re: fix for my stemmer_en2 issue

From: Brad Miele <bmiele(at)not-real.ipnstock.com>
Date: Sat Nov 11 2006 - 23:37:39 GMT
heh,

scratch my last :) i will download .h as well tonight...

Brad
---------------------
Brad Miele
VP Technology
IPNStock.com
866 476 7862 x902
bmiele@ipnstock.com

On Sat, 11 Nov 2006, Peter Karman wrote:

>
>
> Bill Moseley scribbled on 11/11/06 9:47 AM:
>> On Fri, Nov 10, 2006 at 09:24:12PM -0800, Peter Karman wrote:
>>> The difference when I put them back in however was that instead of being
>>> FUZZY_STEMMING_EN they were changed to FUZZY_STEMMING_EN2. FUZZY_STEMMING_EN was
>>> dropped from stemmer.h at the same time.
>>>
>>> To make matters more confusing, the error message indicates that the deprecated
>>> features Stemming_en and Stem will use Stemmer_en1 -- but they are marked with
>>> FUZZY_STEMMING_EN2 even though they call the same init/free functions as
>>> Stemmer_en1.
>>
>> Oh, that's not good.
>>
>>
>>> So, there's definitely something suspicious in stemmer.c I think. I'm going to
>>> commit a change to CVS -- Brad, would you take a look at the CVS version and see
>>> if that works any better?
>>
>> This will require re-indexing.  That table maps the configuration
>> names to an index number used to indicate the stemmer -- and that
>> number is stored in the index to know what stemmer to use when
>> searching.
>
> yes, I got to thinking about this some more last night after I checked in that
> change to stemmer.c. I think there was also a problem in stemmer.h, since I had
> removed FUZZY_STEMMING_EN altogether, which basically meant there was an
> off-by-1 difference in 2.4.3 vs 2.4.4 indexes wrt to the stemmer. It would only
> manifest if you used stemming (which I don't) and searched a 2.4.3 index using
> 2.4.4 library.
>
>
>>
>> Brad's original config had:
>>
>>     FuzzyIndexingMode Stemming_en2
>>
>> which mapped to the "english" stemmer and stored FUZZY_STEMMING_EN2 in
>> the index.  Then when searching FUZZY_STEMMING_EN2 was searched in the
>> table and found the "porter" stemmer as could be seen in his headers:
>>
>>     # Fuzzy Mode: Stemming_en
>>
>> Which could cause problems.  What I'm still confused about is why the
>> size of the index would have made a difference.
>>
>
> that might be a red herring.
>
> The header reports Stemming_en because of the order of the fuzzy_opts[] array.
> Last night's stemmer.c just reordered those. get_fuzzy_mode() just picks the
> first FUZZY_STEMMING_EN2 it finds. The stemmer.c I'm about to check in further
> reorders fuzzy_opts[] to put the deprecated options last in the list, so they
> don't get listed first and confuse folks.
>
>> Peter, that fuzzy_mode index must match up to only one stemmer, but
>> there can be multiple entires for a give fuzzy_mode to allow for
>> aliases (Stem, Stemming_en, Stemming_en1, for example).
>>
>
> I think CVS is right now. I checked in a new stemmer.h just a little bit ago
> that should provide backwards compat with 2.4.3 indexes (by putting back in the
> enum value that could cause off-by-one), and the stemmer.c I just checked in
> should be a little saner.
>
> I see Brad reports that last night's stemmer.c change did the trick for his
> particular case; I suspect it was the EN1/EN2 mixup that was at fault.
>
> I did discover, while checking 2.4.3, 2.4.4 and CVS, that 2.4.4 did in fact
> break the 2.4.3 index format in some way. Unknown header 32. I suspect it's
> related to the RemovedWords/RemovedFiles features of the increm version, in
> db_read.c. But not sure on that. There were a lot of changes in db_read.c
> shortly after 2.4.3 was released, which is a lot of water under the bridge
> before 2.4.4 was released...
>
> What that means is that indexes created with 2.4.3 can be read by 2.4.4, but not
> the other way around. That's actually ok, I think, since I assume that the
> working case is going to be the most common. However, it wasn't clearly
> documented in the Changes anywhere, and we likely should have changed the Magic
> Number to make it explicit. I am not going to change the Magic Number now, since
> that defeats the fix I put in stemmer.h with making CVS backwards compatible
> with 2.4.3. But it ought to get changed for 2.4.5 (whenever that is...).
>
> pek
>
> -- 
> Peter Karman  .  http://peknet.com/  .  peter(at)not-real.peknet.com
>
>
Received on Sat Nov 11 15:37:39 2006