Skip to main content.
home | support | download

Back to List Archive

RE: index size versus searching for quoted string trade

From: Andrew Payne <andrew.payne(at)not-real.calpine.com>
Date: Wed Jun 16 2004 - 20:27:04 GMT
I just had a similar problem. I've got a minimum word length of 3 characters
defined to help keep my indexes manageable, but this causes the search to
fail when one of the terms is less than 3 characters long. It might be cool
to store in the index the rules that exclude content from being indexed, and
apply those rules to the search terms before searching. I've tried using -c
to include the config file when searching, (hoping that the minimum length
rule would be applied to the search terms as well) but the config file
doesn't seem to apply, or at least that part doesn't, when searching. I've
written a filter into my search page, but it's not at all portable. 

As a related question, what does a minimum word length do to the phrase
search capability? Does the phrase search just work by word adjacency. If
so, applying the same rules to the search terms would still allow phrase
searches to match (while adding a little ambiguity.)

-Andy

-----Original Message-----
From: Bill Schell [mailto:friedfish@optonline.net]
Sent: Wednesday, June 16, 2004 09:24
To: Multiple recipients of list
Subject: [SWISH-E] index size versus searching for quoted string
tradeoffs (possible


I just throughly confused myself by searching for a phrase ("Text of 
Report") that I knew
was in the documents I had just indexed.   I couldn't find it!  After 
some head scratching
I realized that the word 'of' is in the file cited in the IgnoreWords 
configuation directive.

If this confused me, it will *really* confuse my users, who know nothing 
about any
IgnoreWords file.   They would have to figure out the they should enter 
"Text Report",
although that is not what is in the document.   The only immediate fix I 
can think of for this
is to get rid of the IgnoreWords directive, which will make my indices 
bigger and slower to
search.

I'm wondering if a future version of swish-e should  remove words cited in
the IgnoreWords file from all search terms?  Or is the performance loss 
on removing the
IgnoreWords directive for a reasonable set of common english words not 
worth worrying
about?

Bill
Received on Wed Jun 16 20:27:07 2004