On Wed, 2003-05-07 at 19:19, John Movius wrote:
> I have a 100 Meg genealogy website (currently using an older version of
> SWISH) and I am interested in using the "fuzzy" indexing mode of SWISH-e
> on it in the near future. I understand that the SWISH-e fuzzy indexing
> feature provides a search similar to the "Soundex" search
SWISH-E implements Don Knuth's Soundex algorithm as well as Metaphone,
DoubleMetaphone, and Stemming algorithms.
> However a fuzzy index is most likely to also be of
> substantially larger size than a normal SWISH-e index.
In the case of Soundex the index should be much smaller. Soundex, as
you probably know, reduces the word to a 4 digit representation. And,
since each numeric digit represents many letters then many different
words are reduced to a single Soundex code.
> Thus I am wondering if two SWISH-e indexes are needed to accomplish my
> goals ...
Yes. You might want to provide an index with soundex and an index
without soundex.
> My questions include: Has anyone on this SWISH-e list had any actual
> experience in using the fuzzy indexing feature of SWISH-e?
I use it a bit. I incorporated the Soundex algorithm into SWISH-E. So,
I'm probably the one responsible if it doesn't work as you expect. ;-)
> Is it possible to have two SWISH-e search engines installed and
> operating on the same web server?
You would simply need multiple index files...
> Has this been done with success ... i.e. are there any examples using it
> to look at on the WWW (only fuzzy? Fuzzy plus normal)?
Without Soundex:
http://webaugur.com/search/?do=now&words=davis&strSearchSiteKey=Genealogy&intMaxHits=10
With Soundex:
http://webaugur.com/search/?do=now&words=davis&strSearchSiteKey=Genealogy_%28Sound_Matching%29&intMaxHits=10
"David" appears at the bottom of all pages. Davis and David are the
same Soundex code. So, with Soundex, all pages are returned when
searching for "Davis" whereas only the D and index pages are returned
when searching without Soundex.
Ideally, one might utilize some sort of metadata to restrict the soundex
index to only the genealogical data itself.
> Does anyone have any stats on the relative size of a regular SWISH-e
> index vs. a fuzzy SWISH-e index? I realize this could vary
> considerably.
$ ls -l genes*
-rw-r--r-- 1 augur users 51136 May 8 02:11 genes-no.idx
-rw-r--r-- 1 augur users 48277 May 8 02:08 genes.idx
genes-no.idx is without Soundex. genes.idx is with Soundex. This is an
extremely small dataset (since I have neglected my genealogy for several
years ;-)
In short, as your dataset becomes larger the index files with soundex
should become increasingly smaller than the normal index.
--
David Norris
http://www.webaugur.com/dave/
ICQ - 412039
Received on Thu May 8 06:30:30 2003