Jose Manuel Ruiz wrote:
> This is how swish-e works:
> Al the words are extracted from the files and, when finished, automatic
> words are removed in removestops function (index.c).
> Well, this is OK for old swish-e (1.3.2). But if you are using PHRASE
> version it is necessary to recalculate the positions of all the
> words to decrease the counter when an automatic stopwords
> precedes any valid word.
>
> For example, a document containing:
>
> this is a phrase in a document
>
> get the following word positions:
>
> this: 1
> is: 2
> a: 3 6
> phrase: 4
> in: 5
> document: 7
>
> Affter procesing automatic stopwords, word "a" is removed
> and the positions remain as follows:
>
> this: 1
> is: 2
> phrase: 4
> in: 5
> document: 7
>
> But they should be:
>
> this: 1
> is: 2
> phrase: 3
> in: 4
> document: 5
Are you sure about this? That means if a user searches for the phrase
"in document" you'll turn up this entry even though the actual phrase is
"in a document."
Is it possible to detect all stop words at search time? You could then
code up search for the phrase "in a document" to find the word "in" at
position X, no word at position X+1, and the word "document" at X+2. I
admit it still wouldn't be perfect since you could not differentiate
between "in a document" and "in the document" but it seems to match
expectations (at least my expectations) better.
moo
------------------------------------------------------------
Ron Samuel Klatchko - Senior Software Jester
Brightmail Inc - rsk@brightmail.com
Received on Mon May 8 13:49:08 2000