Re: PHRASE search: More about stopwords

From: Jose Manuel Ruiz <jmruiz(at)>
Date: Fri May 12 2000 - 08:51:31 GMT
Bill Moseley wrote:
> At 10:32 AM 05/11/00 -0700, Jose Manuel Ruiz wrote:
> >Some things about phrase search and stopwords that
> >will be implemented in next beta.
> >
> >Since there are several opinions about stopwords
> >and word position I think the this may be a solution
> >for all: A config file option to enable or disable
> >position increasing when stopwords are in a phrase.
> >In the second case, indexing will be slower because positions
> >must be recalculated when automatic stopwords are found.
> >
> >Words shorter than minwordlength are also added to stopwords
> >list. In fact, they are like stopwords.
> Am I correct that if you specify stopwords in the config file such that no
> additional (automatic) stopwords are found during indexing that swish will
> not need to reposition?  In other words, will indexing speed be the same if
> not automatic stopwords are found?

You can ensure that no automatic stopwords are searched with 
IgnoreLimit 100 xxxxx   (xxxx is whataever you want). This will be the
default value. This value is checked in removestops function and, 
if it is 100, no action is performed.

Anyway, if you like it, you can have both IgnoreWords and IgnoreLimit.
But if
you know your stopwords, it is better for indexing performance if you
put them in IgnoreWords. Automatic stopwords (IgnoreLimit wods) are
computed at
the end of the indexing proccess and stopwords (IgnoreWords) are
computed at
the begining. So, automatic stopwords need to be removed from the total
of valid words at the end of the index process and thus, recalculate all 
positions of all the remaining valid words in all files (this is the CPU

A good approach: run the index process without stopwords and with the
IgnoreLimit parameter you like. Do a DUMP (option -D) of the index file
to see
the automatic stopwords and then, modify IgnoreWords with those values.

Remember that this is not yet available in the last beta.
I think that all this stuff will be available next week.

> Bill Moseley


Jose Manuel Ruiz Ramos
Received on Fri May 12 04:56:33 2000