Skip to main content.
home | support | download

Back to List Archive

Re: RE: stemming

From: SRE <eckert(at)not-real.climber.org>
Date: Fri Nov 19 1999 - 06:02:31 GMT
At 08:41 PM 11/18/99 -0800, David Norris wrote:
>This is probably correct.  The stemmer simply strips the suffix from a
>word.  "Rockies" is "Rocki".  "Supplies" is "Suppli"

I get it. Knowing that, I did a few more tests:
 rockies --> 1 hit  (htm)
 rocki   --> 1 hit  (htm)
 rocky   --> 1 hit  (htm)
 rock    --> 3 hits (htm,txt,html - see filenames below)

This might not be what I expected, but at least it mostly makes sense.
The confusing part is that 'rocky' gets indexed as 'rocki' !

  % swish-e -f eckert.index -D | grep rock
     blackrock: 8 39 1 1
     rock: 1 10 41 1 10 2 1 1 12 21 9 1
     rockbound: 10 3 1 1 12 32 9 1
     rockhous: 10 3 1 1 12 32 9 1
     rocki: 1 28 9 1

I verified with grep that "rocki" does not occur in the directory, and
that "rocky" does (in the .htm file that represents the "1 hit" above!
  % grep -i rock *
     bearboxes.htm:near a large rock and visible hitch racks.
     bearboxes.htm:and near some rock bluffs approximately 250 yards west of the lake.
     bearboxes.htm:One box on the rocky peninsula on the south side of Lower Soldier Lake.
     rangercontacts.txt:ref: Blackrock Station, Sherman Pass, Kern Peak, Domelands (1999)
     sierrapeakslist.html:  *  class  5:  Technical rock climbing
     sierrapeakslist.html:Cartago Peak is the highest rock pile out of several nearby;

I don't suppose there is any way to know which variant of a word
got matched, right? I'd like to display that in my results page
if it's not an exact match, but I'm pretty sure that's not possible.
Received on Thu Nov 18 21:58:22 1999