Skip to main content.
home | support | download

Back to List Archive

Re: OCR/Double Metaphone phrase issue

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Mon Nov 18 2002 - 15:45:10 GMT
At 06:56 AM 11/18/02 -0800, Erik Corry wrote:
>I'm evaluating Swish-E for use with data that has been scanned and
>OCRred.  Looks great.  I have some ideas for how to do a fuzzy
>search that catches OCR errors, but it involves generating
>several indexing words for each word in the input.  This is
>also something the Double-Metaphone method does now - sometimes
>there are two words that are output from the Double-Metaphone
>encoding.   They are placed at the same word index.
>
>Unfortunately you can't use phrase searches if Metaphone does
>this.  That's a bit of a downer.  My guess is that this is
>simply because of the parsing and handling of the query.

Ah, one of the documented bugs!

It's a problem with the parser because the parser needs a rewrite.  When
searching it expands a word in two metaphones (wordA OR wordB) and you
can't use an expression like that inside of a phrase.  For example, if the
word "and" is NOT a stop word then searching for 

   -w foo and bar

has to find docs with both "foo" and the word "bar", where

   -w "foo and bar"

has to find all three words as a phrase.

The plan is to rewrite the parser.  But it's one of those things that we
have not got around to yet.  Anyone is welcome to jump right in and fix it!

>If we invented a new symbol for 'phrase' eg. '#' then the (web)
>frontend could transform the user's query from say:
>
>"Fred Pollack"
>
>into
>
>fred # pollack
>
>and then into
>
>(fred | fre | frd | red) # (pollack | pollac | pollak | pollck | polack |
pllack | ollack)

Or the parser needs to generate a recursive data structure with flags on
each word to indicate things like being in a phrase.



-- 
Bill Moseley
mailto:moseley@hank.org
Received on Mon Nov 18 15:45:22 2002