Skip to main content.
home | support | download

Back to List Archive

Re: Fuzzy Indexing Questions

From: David L Norris <dave(at)not-real.webaugur.com>
Date: Thu May 08 2003 - 06:26:31 GMT
On Wed, 2003-05-07 at 19:19, John Movius wrote:
> I have a 100 Meg genealogy website (currently using an older version of
> SWISH) and I am interested in using the "fuzzy" indexing mode of SWISH-e
> on it in the near future.  I understand that the SWISH-e fuzzy indexing
> feature provides a search similar to the "Soundex" search 

SWISH-E implements Don Knuth's Soundex algorithm as well as Metaphone,
DoubleMetaphone, and Stemming algorithms.

> However a fuzzy index is most likely to also be of
> substantially larger size than a normal SWISH-e index.

In the case of Soundex the index should be much smaller.  Soundex, as
you probably know, reduces the word to a 4 digit representation.  And,
since each numeric digit represents many letters then many different
words are reduced to a single Soundex code.

> Thus I am wondering if two SWISH-e indexes are needed to accomplish my
> goals ...

Yes.  You might want to provide an index with soundex and an index
without soundex.

> My questions include: Has anyone on this SWISH-e list had any actual
> experience in using the fuzzy indexing feature of SWISH-e?

I use it a bit.  I incorporated the Soundex algorithm into SWISH-E.  So,
I'm probably the one responsible if it doesn't work as you expect.  ;-)

> Is it possible to have two SWISH-e search engines installed and
> operating on the same web server? 

You would simply need multiple index files...

> Has this been done with success ... i.e. are there any examples using it
> to look at on the WWW (only fuzzy? Fuzzy plus normal)?

Without Soundex: 
http://webaugur.com/search/?do=now&words=davis&strSearchSiteKey=Genealogy&intMaxHits=10

With Soundex:
http://webaugur.com/search/?do=now&words=davis&strSearchSiteKey=Genealogy_%28Sound_Matching%29&intMaxHits=10

"David" appears at the bottom of all pages.  Davis and David are the
same Soundex code.  So, with Soundex, all pages are returned when
searching for "Davis" whereas only the D and index pages are returned
when searching without Soundex.  

Ideally, one might utilize some sort of metadata to restrict the soundex
index to only the genealogical data itself.

> Does anyone have any stats on the relative size of a regular SWISH-e
> index vs. a fuzzy SWISH-e index?  I realize this could vary
> considerably.   

$ ls -l genes*
-rw-r--r--  1 augur  users  51136 May  8 02:11 genes-no.idx
-rw-r--r--  1 augur  users  48277 May  8 02:08 genes.idx

genes-no.idx is without Soundex.  genes.idx is with Soundex.  This is an
extremely small dataset (since I have neglected my genealogy for several
years ;-)

In short, as your dataset becomes larger the index files with soundex
should become increasingly smaller than the normal index.

-- 
 David Norris
  http://www.webaugur.com/dave/
  ICQ - 412039
Received on Thu May 8 06:30:30 2003