At 11:26 PM 12/2/2001 -0800, Malcolm Box wrote:
>Does swish-e use the configured list of stopwords when searching so as
>to filter out looking for words that don't exist in the index?
Yes. Stopwords are NOT in the index, so it seems reasonable to not search
for them.
> From a few experiments here it would appear that it does not, which
>leads to some interesting result. To whit, using the supplied english
>stopwords list for indexing, and then searching for "to" finds a few
>documents, noticably those with "to" in a word that is not also a
>stopword. This is somewhat confusing to a user of the search engine!
It's even more confusing to me ;) Please post real examples!
The simple solution is to not use stopwords.
>I'd suggest that if a stoplist is configured then swish-e should not
>search for words that are part of the stoplist, instead displaying a
>message (a la Google).
Here's what swish-e provides:
> cat c
ignorewords to
> cat 1.html
That is the stopword to right there.
> ./swish-e -c c -i 1.html -T indexed_words -v 0
Indexing Data Source: "File-System"
Adding:[swishdefault:1] 'that' Pos:1 Stuct:0x1 ( FILE )
Adding:[swishdefault:1] 'is' Pos:2 Stuct:0x1 ( FILE )
Adding:[swishdefault:1] 'the' Pos:3 Stuct:0x1 ( FILE )
Adding:[swishdefault:1] 'stopword' Pos:4 Stuct:0x1 ( FILE )
Adding:[swishdefault:1] 'right' Pos:5 Stuct:0x1 ( FILE )
Adding:[swishdefault:1] 'there' Pos:6 Stuct:0x1 ( FILE )
Indexing done!
> ./swish-e -w 'stopword to' -H 9
# SWISH format: 2.1-dev-24
# Search words: stopword to
#
# Index File: index.swish-e
# Removed stopword: to
Here's swish just told you that the word was removed from the query.
..
# StopWords: to
List of stopwords in this index file
# BuzzWords:
# Search Words: stopword to
# Parsed Words: stopword
The query swish used after adjusting for Wordchars, stopwords and other
settings that might effect the search (and indexed) words.
There's also this
> ./swish-e -w 'to' -H 9
# SWISH format: 2.1-dev-24
# Search words: to
#
# Index File: index.swish-e
# Removed stopword: to
err: all search words too common to be useful
.
Bill Moseley
mailto:moseley@hank.org
Received on Mon Dec 3 14:56:40 2001