Re: IgnoreLimit (was Re: Q: Segmentation Fault?)

From: Bill Moseley <moseley(at)>
Date: Mon Feb 19 2001 - 16:00:44 GMT
At 06:15 AM 02/19/01 -0800, Gastlogin Internet-Cafe Muenchen wrote:
>Cool! I actually found a real bug (although it has been fixed). I
>suppose IgnoreWords requires defining stop words manually which can be a
>problem if one is indexing a language one doesn't fully understand, and
>so auto-stopwords might be better in such situations. 

I think the idea is to use IgnoreLimit the first time you index.  Swish is
suppose to list the words removed and then you add those words to your
IgnoreWords setting.

The time difference is big -- Without IgnoreLimit it took about 45 seconds
to index.  With "IgnoreLimit 70 500" it's been running for about fifteen
minutes so far and is still only on the third word....

The printing of the removed words wasn't working, though.  I modified the
code just now, but I'm still not sure of how it should print -- With
verbose set (I use IndexReport 1), it generates a lot of output (displaying
"Computing new positions for..." every word in the index, it seems).
That's probably more output than is needed.

Removing word #0 '4' (5678 occurrences)
Computing new positions for rifl (5 occurrences)

Removing word #1 'http' (14714 occurrences)
Computing new positions for to (9938 occurrences)

Removing word #2 'of' (14319 occurrences)

And after about 15 minutes I killed it.  Jose, would it be possible to make
all the adjustments in one pass or must they be made one word at a time?

>I don't really know if I can use CVS. 


>Is the latest available archive usually at:
> ?

It is now...Jose often puts up a tarball on his site, too.  As I said, I
use CVS so I don't always think to make a distribution and put it on my web

CVS is best for most up to date (includes the newest features and bugs!).
I've been meaning to write a cron job to do a cvs update once a day and
make a distribution, but haven't got around to it.  (Roy, if I do this can
you link to it from the main swish page?)

Bill Moseley
