Skip to main content.
home | support | download

Back to List Archive

Re: Phrase search

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Wed Apr 05 2000 - 16:23:29 GMT
At 08:41 AM 04/05/00 -0700, Jose Manuel Ruiz wrote:
>This is how I have implented it: word position is always
>incremented (if there is a stopword, it is incremented too). 
>In fact, word position is also incremented when other non
>blank nor new-line character is found.

This is really hard.  I think you have to define words as swish does using
WordCharacters and IgnoreFirst and IgnoreLast.  You can't use new-line
because it's often html that's indexed, of course.

I would, though, think it would be good to be able to define a few word
ending characters (such as a period or a comma) that would bump up the word
count.
This would allow people, for example, to define if a phase could match
across sentences or not without having to explicitly type the period in
their searches.

>I made a minor change here in the code, so you can 
>define the rules in swish.h (using a simple #define clause).
>So, if you define "<and>" as the rule instead of "and", and if
>"and" is not a stopword, you can find "Joe and Mary".

Good addition!

>Anyway, If you have stopwords in the index file, you can not search
>for phrases that contain stopwords. This is, for example, how
>Verity's Search Information Server (a commercial searcher) works.
>You do need to store stopwords with their position in the index file
>for phrase search.

I'm confused ;)  Isn't a stop word by definition a word that's not in the
index?

Why do you need the stop words in the index?  Say you have the phrase

        "...Swish is a search engine...."
word:         4    - -    5      6

where "is" and "a" are stop words.

Searching for:  swish is a search engine
will throw out "is" and "a" in the search.  Without quotes around the
search phrase will find any documents that have "swish" "search" and
"engine".  But, searching with quotes will require that the words found are
sequential and in the correct order. 

I'm really curious to see how big the index becomes with all the word
positions stored.


Bill Moseley
mailto:moseley@hank.org
Received on Wed Apr 5 12:28:34 2000