Skip to main content.
home | support | download

Back to List Archive

What to do with Swish::Stemmer

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Tue Jul 20 2004 - 18:26:17 GMT
I just noticed that the version of SWISH::Stemmer on the CPAN is
different from the one in the distribution.  Basically, the one on
CPAN stemms differently than the one in the distribution.

SWISH::Stemmer contains the original stemming code extracted from
swish-e.  It was created before there was an API for swish to stem
using the swish-e C library.

There's a few problems with SWISH::Stemmer.  First, if it gets out of
sync with the stemming code inside swish-e then it might not stem the
same way that swish-e stems while indexing.  That's the case now with
the CPAN version.  Second, it only does one type of stemming, where
swish-e has a number of stemmers available.

The best solution is to use the SWISH::API module for both searching
and stemming (as Jonas Wolf posted with his patch to the highlighting
code the other day), but that won't work if using the swish-e binary
for searching.

So, if SWISH::Stemmer needs to stay around then it either needs to be
updated whenever the swish-e stemmer.c code changes (harder to track)
or make SWISH::Stemmer a thin wrapper around SWISH::API and figure out
some way in SWISH::API to provide a Stem() function that doesn't need
a swish handle.  (I'm thinking out loud a bit here.)

Here's why I'm posting now:  I like the idea of making SWISH::Stemmer
a wrapper around SWISH::API, but I wonder if that's a performance
issue loading the large SWISH::API vs. loading the small
SWISH::Stemmer module.  Anyone know if that's an issue on modern
operating systems?  That is, is the OS smart enough to only load
what's needed from the shared library?


-- 
Bill Moseley
moseley@hank.org
Received on Tue Jul 20 11:26:54 2004