Skip to main content.
home | support | download

Back to List Archive

Re: Stopwords when searching?

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Mon Dec 03 2001 - 14:55:52 GMT
At 11:26 PM 12/2/2001 -0800, Malcolm Box wrote:
>Does swish-e use the configured list of stopwords when searching so as 
>to filter out looking for words that don't exist in the index?

Yes.  Stopwords are NOT in the index, so it seems reasonable to not search
for them.

> From a few experiments here it would appear that it does not, which 
>leads to some interesting result.  To whit, using the supplied english 
>stopwords list for indexing, and then searching for "to" finds a few 
>documents, noticably those with "to" in a word that is not also a 
>stopword.  This is somewhat confusing to a user of the search engine!

It's even more confusing to me ;)  Please post real examples!

The simple solution is to not use stopwords.

>I'd suggest that if a stoplist is configured then swish-e should not 
>search for words that are part of the stoplist, instead displaying a 
>message (a la Google).

Here's what swish-e provides:

> cat c
ignorewords to


> cat 1.html
That is the stopword to right there.

> ./swish-e -c c -i 1.html -T indexed_words -v 0
Indexing Data Source: "File-System"
    Adding:[swishdefault:1]   'that'   Pos:1  Stuct:0x1 ( FILE )
    Adding:[swishdefault:1]   'is'   Pos:2  Stuct:0x1 ( FILE )
    Adding:[swishdefault:1]   'the'   Pos:3  Stuct:0x1 ( FILE )
    Adding:[swishdefault:1]   'stopword'   Pos:4  Stuct:0x1 ( FILE )
    Adding:[swishdefault:1]   'right'   Pos:5  Stuct:0x1 ( FILE )
    Adding:[swishdefault:1]   'there'   Pos:6  Stuct:0x1 ( FILE )
Indexing done!

> ./swish-e -w 'stopword to' -H 9
# SWISH format: 2.1-dev-24
# Search words: stopword to
#
# Index File: index.swish-e
# Removed stopword: to

Here's swish just told you that the word was removed from the query.

..

# StopWords: to

List of stopwords in this index file

# BuzzWords:
# Search Words: stopword to
# Parsed Words: stopword 

The query swish used after adjusting for Wordchars, stopwords and other
settings that might effect the search (and indexed) words.

There's also this
> ./swish-e -w 'to' -H 9         
# SWISH format: 2.1-dev-24
# Search words: to
#
# Index File: index.swish-e
# Removed stopword: to
err: all search words too common to be useful
.


Bill Moseley
mailto:moseley@hank.org
Received on Mon Dec 3 14:56:40 2001