Skip to main content.
home | support | download

Back to List Archive

Alpha version Phrase Search

From: Jose Manuel Ruiz <jmruiz(at)>
Date: Mon Apr 10 2000 - 15:42:40 GMT
Well, at last the alpha version!!

You can find the code in

Here is the README-PHRASE.

These code is based in swish-1.3.2-SRE but I have made lots
of changes to it:

1- Modifications in programs index.c search.c and merge.c to get better
performance. This includes:

1a - A new organization of offsets (longs) which now are stored in just
"portable" bytes to save space

1b - A hash approach has added for speed up searches. This reduces disk
Now you can search for things like "a* or b* or c* or d* or e* ..."
the penalty of reading the linked list for each word of the expanding
Using 4 bytes for longs, not much space is required to do it (just 4
per word) and a little more space to handle the initial hash table.

1c - Internally, I have changed how results are organized. Now it uses
array aproach instead of a bynary tree. This let use qsort and bsearch.
merge is faster and "not" results are also faster.
You can now search for things like "not axdf" which will probably return
the documents without penalty performance.

1d - Added some calls to free() to save space when computing results.
is useful when long lists of results are computed (ie: a* or b* or c*

1e - Increased the size of the hashtable fo "or" results to achieve
performace (changed hash to bighash).

2- More information has been added to de index:

2a - Frequency of each word in the file

2b - Relative positions of each word in the file. This is specially
for phrase search.

2c - Due to modifications mentioned in 1, an extra 4 bytes per word are
stored for maintaining the hash linked list. Also an initial array of
longs (4 bytes per long) are stored for the hash table.

3 - Phrase search:

3a - Implementation of phrase search (thanks to Eckert-SRE and Bill
for their advices). These is how it may work:

First of all, you need a phrase character delimiter. You can use what
you want but avoid those who does not fit very well with UNIX shells
Of course, you must use a character which is not a character allowed in
a word for for indexing. Let us say that our character is "_". So,
for "Berkeley University" will look like _Berkeley University_.
In a metaname called location would be: location=_Berkeley University_.

To change this char modify the following two lines in swish.h

What happens if we want to find exactly "Joe and Mary". "and" is a
word. If this is important to you, change
#define AND_WORD "and"
in swish.h for whatever you like. This can also be useful when using
languages different to english. For example, in spanish it would be:
#define AND_WORD "y".

Of course, all changes made to swish.h needs swish-e to be rebuilt.

Not tested yet:

- Stemming. Can anybody do it? I think stemming is at the same point
it was before I have made the changes but...
- Merge. Although it seems to work, I made a lot of changes on it to
- Stopwords and phrase search. Let me know your opnions in this point.
stopwords be included in the  word counter or not?. Look at phrase

To do:

- Better compression for integer long values. Now, as word position are
stored in the index file, it is bigger. gzip? Any ideas?
- A new operand: "near". Word1 is n positions near word2.
- External filters? I think there is some work made. Let me know.
- Better dump (option -D). Now, there is more info to show.

Finally, keep in mind

1- It is a very alpha version. Use it for testing purposes and if it is
useful to you give me some feedback in the swish-e discussion list.

2- It is not compatible with earlier versions. This means that you MUST
your data.

By the way, this is FREE code, look at swish-e licence for more info.

Jose Manuel Ruiz Ramos
Received on Mon Apr 10 11:44:30 2000