I just noticed that the version of SWISH::Stemmer on the CPAN is
different from the one in the distribution. Basically, the one on
CPAN stemms differently than the one in the distribution.
SWISH::Stemmer contains the original stemming code extracted from
swish-e. It was created before there was an API for swish to stem
using the swish-e C library.
There's a few problems with SWISH::Stemmer. First, if it gets out of
sync with the stemming code inside swish-e then it might not stem the
same way that swish-e stems while indexing. That's the case now with
the CPAN version. Second, it only does one type of stemming, where
swish-e has a number of stemmers available.
The best solution is to use the SWISH::API module for both searching
and stemming (as Jonas Wolf posted with his patch to the highlighting
code the other day), but that won't work if using the swish-e binary
for searching.
So, if SWISH::Stemmer needs to stay around then it either needs to be
updated whenever the swish-e stemmer.c code changes (harder to track)
or make SWISH::Stemmer a thin wrapper around SWISH::API and figure out
some way in SWISH::API to provide a Stem() function that doesn't need
a swish handle. (I'm thinking out loud a bit here.)
Here's why I'm posting now: I like the idea of making SWISH::Stemmer
a wrapper around SWISH::API, but I wonder if that's a performance
issue loading the large SWISH::API vs. loading the small
SWISH::Stemmer module. Anyone know if that's an issue on modern
operating systems? That is, is the OS smart enough to only load
what's needed from the shared library?
--
Bill Moseley
moseley@hank.org
Received on Tue Jul 20 11:26:54 2004