Skip to main content.
home | support | download

Back to List Archive

Re: New version swish-e-1.3.2-PHRASEi

From: Ron Samuel Klatchko <rsk(at)not-real.corpmail.brightmail.com>
Date: Mon May 08 2000 - 17:46:41 GMT
Jose Manuel Ruiz wrote:
> This is how swish-e works:
> Al the words are extracted from the files and, when finished, automatic
> words are removed in removestops function (index.c).
> Well, this is OK for old swish-e (1.3.2). But if you are using PHRASE
> version it is necessary to recalculate the positions of all the
> words to decrease the counter when an automatic stopwords
> precedes any valid word.
> 
> For example, a document containing:
> 
> this is a phrase in a document
> 
> get the following word positions:
> 
> this: 1
> is: 2
> a: 3 6
> phrase: 4
> in: 5
> document: 7
> 
> Affter procesing automatic stopwords, word "a" is removed
> and the positions remain as follows:
> 
> this: 1
> is: 2
> phrase: 4
> in: 5
> document: 7
> 
> But they should be:
> 
> this: 1
> is: 2
> phrase: 3
> in: 4
> document: 5

Are you sure about this?  That means if a user searches for the phrase
"in document" you'll turn up this entry even though the actual phrase is
"in a document."

Is it possible to detect all stop words at search time?  You could then
code up search for the phrase "in a document" to find the word "in" at
position X, no word at position X+1, and the word "document" at X+2.  I
admit it still wouldn't be perfect since you could not differentiate
between "in a document" and "in the document" but it seems to match
expectations (at least my expectations) better.

moo
------------------------------------------------------------
        Ron Samuel Klatchko - Senior Software Jester
            Brightmail Inc - rsk@brightmail.com
Received on Mon May 8 13:49:08 2000