Skip to main content.
home | support | download

Back to List Archive

Re: Phrase search

From: Jose Manuel Ruiz <jmruiz(at)not-real.boe.es>
Date: Wed Apr 05 2000 - 15:42:04 GMT
SRE,

More about "Joe and Mary"...

This is how I have implented it: word position is always
incremented (if there is a stopword, it is incremented too). 
In fact, word position is also incremented when other non
blank nor new-line character is found.

So if Joe is in position n, Mary is in position n+2. The same
aplies to "Joe, Mary".

So if you search for "Joe and Mary" you can not find it if...
1- "and" is a stopword. You just cannot find "and".
2- "and" is a reseved word for a rule. If "and" is a rule, the 
isrule function in search.c returns true and it is not treated
as a word. I made a minor change here in the code, so you can 
define the rules in swish.h (using a simple #define clause).
So, if you define "<and>" as the rule instead of "and", and if
"and" is not a stopword, you can find "Joe and Mary".

Anyway, If you have stopwords in the index file, you can not search
for phrases that contain stopwords. This is, for example, how
Verity's Search Information Server (a commercial searcher) works.
You do need to store stopwords with their position in the index file
for phrase search.

I think I will make the code available in a couple of days. Then,
I am sure that more problems will appear.

Because of my work, I have been occupied with some other things
and I could not make a long testing.

Have a nice day

SRE wrote:
> 
> At 12:42 AM 4/4/00 -0700, Jose Manuel Ruiz wrote:
> >2- Searching 'Joe and Mary' It is not posible because and
> >is a reserved word. I think using "<and>", "<or>", "<not>"
> >will make things easier but this is major change for all
> >the cgi programs working up to day.
> 
> Good CGI scripts will check the swish version and adapt
> (or refuse to run if they don't know about the current version).
> I love forward compatibility, but in this case you either need
> a way to bypass the stop word OR a way to index it anyway.
> 
> Option 1: parse the search phrase, find out if it includes
> a stop word, and match if ANY word is where the search phrase
> had a stop word. For instance, "Joe and Mary" would match
> "Joe kissed Mary" but would NOT match "Joe slowly kissed Mary".
> This could be done strictly with word positions, where you
> ignore the word and increment the word counter if the phrase
> contains a stop word. Of course, matching "Joe and not Mary"
> would have to count two stop word skips, etc.
> 
> Option 2: don't have any stop words if you are indexing for
> phrase matches. I think this is unworkable, but it's an option.
> 
> >I think it will not be difficult to add a "near" operator.
> >I mean, searching for a word wich is as least n positions far
> >from the other. Could it be interesting?
> 
> Absolutely! Especially since the CGI script could define
> what "near" means in terms of max-number-of-words-between.
> 
> SRE
> 
> mailto:eckert(at)not-real.climber.org | http://www.climber.org/eckert/
> Info on peak climbing email lists mailto:info@climber.org
> 
> I just forgot my whole philosophy of life...
> Someone tell me what to put here, please!

-- 

Jose Manuel Ruiz Ramos

jmruiz@boe.es

Jefe de Area Informatica
Boletin Oficial del Estado
Manoteras 54
Madrid 28050
Spain
Received on Wed Apr 5 11:44:06 2000