I believe that vers 2.4.x *does* remove stopwords (IgnoreWords) from
queries as well as ignoring them during indexing.
I just did this test:
karman@topaz08 299% cat test.txt
hello to all the world
karman@topaz08 300% cat config
IgnoreWords File: ./ignore
DefaultContents TXT*
and then indexed:
karman@topaz08 295% swish-e -i test.txt -c config
Indexing Data Source: "File-System"
Indexing "test.txt"
Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 2 words alphabetically
Writing header ...
Writing index entries ...
Writing word text: Complete
Writing word hash: Complete
Writing word data: Complete
2 unique words indexed.
4 properties sorted.
1 file indexed. 23 total bytes. 2 total words.
Elapsed time: 00:00:01 CPU time: 00:00:00
Indexing done!
and then searched:
karman@topaz08 296% swish-e -w 'hello world'
# SWISH format: 2.4.1
# Search words: hello world
# Removed stopwords:
# Number of hits: 1
# Search time: 0.224 seconds
# Run time: 0.266 seconds
1000 test.txt "test.txt" 23
.
karman@topaz08 297% swish-e -w 'hello to all the world'
# SWISH format: 2.4.1
# Search words: hello to all the world
# Removed stopwords: to all the
# Number of hits: 1
# Search time: 0.223 seconds
# Run time: 0.264 seconds
1000 test.txt "test.txt" 23
you'll note that both queries found the document.
Am I totally misunderstanding your question?
pek
Bill Schell wrote on 06/16/2004 11:24 AM:
> I just throughly confused myself by searching for a phrase ("Text of
> Report") that I knew
> was in the documents I had just indexed. I couldn't find it! After
> some head scratching
> I realized that the word 'of' is in the file cited in the IgnoreWords
> configuation directive.
>
> If this confused me, it will *really* confuse my users, who know nothing
> about any
> IgnoreWords file. They would have to figure out the they should enter
> "Text Report",
> although that is not what is in the document. The only immediate fix I
> can think of for this
> is to get rid of the IgnoreWords directive, which will make my indices
> bigger and slower to
> search.
>
> I'm wondering if a future version of swish-e should remove words cited in
> the IgnoreWords file from all search terms? Or is the performance loss
> on removing the
> IgnoreWords directive for a reasonable set of common english words not
> worth worrying
> about?
>
> Bill
--
Peter Karman - Software Publications Programmer - Cray Inc
phone: 651-605-9009 - mailto:karman@cray.com
Received on Thu Jun 17 12:43:04 2004