brad miele scribbled on 11/10/06 3:22 PM:
> yes, it seems that the volume of files/words was a factor, since it
> didn't/doesn't crop up with smaller sets.
>
> this test was on the full set, so i am sort of baffled by why that change
> would make the difference.
>
> i guess i should keep looking for a more real solution. the stemmer_en1
> doesn't seem to do as good of a job (at least according to our
> salespeople), and we can't seem to make the jump to 2.4.4 with en2
>
> i find that when i remove the two
>>> references at the top to:
>>>
>>> { FUZZY_STEMMING_EN2, "Stemming_en", Stem_snowball,
>>> porter_create_env, porter_close_env, porter_stem },
>>> { FUZZY_STEMMING_EN2, "Stem", Stem_snowball,
>>> porter_create_env, porter_close_env, porter_stem },
>> That's just a mapping table -- it maps the config names ("None",
>> "Stemming_en", etc.) to the code for that stemmer.
>>
>> The difference between 2.4.3 and 2.4.4 is that we removed the old
>> Porter stemmer so Stem and Stemming_en were changed to use the new
>> snowball stemmer code instead of the old Porter code.
>>
I took a look at the diffs from 2.4.3 through 2.4.4. Looks like there were a
couple changes: one where I took out the Stemming_en and Stem options, and
another when I put them back in with a warning.
The difference when I put them back in however was that instead of being
FUZZY_STEMMING_EN they were changed to FUZZY_STEMMING_EN2. FUZZY_STEMMING_EN was
dropped from stemmer.h at the same time.
To make matters more confusing, the error message indicates that the deprecated
features Stemming_en and Stem will use Stemmer_en1 -- but they are marked with
FUZZY_STEMMING_EN2 even though they call the same init/free functions as
Stemmer_en1.
So, there's definitely something suspicious in stemmer.c I think. I'm going to
commit a change to CVS -- Brad, would you take a look at the CVS version and see
if that works any better?
And here's a little script to test all the stemmers. Use it like:
perl stemtest.pl wordIwant2stem
and it will show how each stemmer handles wordIwant2stem. Note that the
SWISH::API 0.04 is required for a working Fuzzify() method.
------------------------------8<snip--------------------------
#!/usr/bin/perl
#
# test the Swish-e stemmers
#
#
use strict;
use warnings;
use SWISH::API; # requires 0.04 or later for working Fuzzify()
my $usage = "$0 word2stem";
my $html = 'stem_test.html';
my $word = shift @ARGV or die $usage;
unless (-s $html)
{
open(S, ">$html") or die "can't write $html: $!";
print S '<html>some words here that do not matter</html>';
close(S);
}
my @warm_fuzzies = qw(
Stemming_en
Stem
None
Soundex
Metaphone
DoubleMetaphone
Stemming_es
Stemming_fr
Stemming_it
Stemming_pt
Stemming_de
Stemming_nl
Stemming_en1
Stemming_en2
Stemming_no
Stemming_se
Stemming_dk
Stemming_ru
Stemming_fi
);
for my $f (@warm_fuzzies)
{
my $index = i($f);
my $swish = SWISH::API->new($index);
my $fuzzy = $swish->Fuzzify($index, $word);
print "$f -> " . join(' ', $fuzzy->word_list) . "\n";
}
sub i
{
my $f = shift;
my $index = "$f.index";
return $index if -s $index; # don't create more than once.
system("echo 'FuzzyIndexingMode $f' > config");
system("swish-e -i $html -c config -f $index 1>/dev/null");
return $index;
}
------------------------------8<snip--------------------------
--
Peter Karman . http://peknet.com/ . peter(at)not-real.peknet.com
Received on Fri Nov 10 21:26:48 2006