Skip to main content.
home | support | download

Back to List Archive

Alpha version Phrase Search

From: Jose Manuel Ruiz <jmruiz(at)not-real.boe.es>
Date: Mon Apr 10 2000 - 15:42:40 GMT
Well, at last the alpha version!!

You can find the code in http://www.boe.es/swish-e

Here is the README-PHRASE.

These code is based in swish-1.3.2-SRE but I have made lots
of changes to it:

1- Modifications in programs index.c search.c and merge.c to get better
performance. This includes:

1a - A new organization of offsets (longs) which now are stored in just
4
"portable" bytes to save space

1b - A hash approach has added for speed up searches. This reduces disk
i/o
Now you can search for things like "a* or b* or c* or d* or e* ..."
without
the penalty of reading the linked list for each word of the expanding
list.
Using 4 bytes for longs, not much space is required to do it (just 4
bytes 
per word) and a little more space to handle the initial hash table.

1c - Internally, I have changed how results are organized. Now it uses
an
array aproach instead of a bynary tree. This let use qsort and bsearch.
Now
merge is faster and "not" results are also faster.
You can now search for things like "not axdf" which will probably return
all
the documents without penalty performance.

1d - Added some calls to free() to save space when computing results.
This
is useful when long lists of results are computed (ie: a* or b* or c*
..)

1e - Increased the size of the hashtable fo "or" results to achieve
better
performace (changed hash to bighash).

2- More information has been added to de index:

2a - Frequency of each word in the file

2b - Relative positions of each word in the file. This is specially
useful
for phrase search.

2c - Due to modifications mentioned in 1, an extra 4 bytes per word are
stored for maintaining the hash linked list. Also an initial array of
longs (4 bytes per long) are stored for the hash table.

3 - Phrase search:

3a - Implementation of phrase search (thanks to Eckert-SRE and Bill
Moseley
for their advices). These is how it may work:

First of all, you need a phrase character delimiter. You can use what
you want but avoid those who does not fit very well with UNIX shells
('").
Of course, you must use a character which is not a character allowed in
a word for for indexing. Let us say that our character is "_". So,
searching
for "Berkeley University" will look like _Berkeley University_.
In a metaname called location would be: location=_Berkeley University_.

To change this char modify the following two lines in swish.h
#define PHRASE_DELIMITER_CHAR '_'
#define PHRASE_DELIMITER_STRING "_"

What happens if we want to find exactly "Joe and Mary". "and" is a
reserved
word. If this is important to you, change
#define AND_WORD "and"
in swish.h for whatever you like. This can also be useful when using
other
languages different to english. For example, in spanish it would be:
#define AND_WORD "y".

Of course, all changes made to swish.h needs swish-e to be rebuilt.

Not tested yet:

- Stemming. Can anybody do it? I think stemming is at the same point
it was before I have made the changes but...
- Merge. Although it seems to work, I made a lot of changes on it to
boost
performance.
- Stopwords and phrase search. Let me know your opnions in this point.
May
stopwords be included in the  word counter or not?. Look at phrase
search
discussion.

To do:

- Better compression for integer long values. Now, as word position are
stored in the index file, it is bigger. gzip? Any ideas?
- A new operand: "near". Word1 is n positions near word2.
- External filters? I think there is some work made. Let me know.
- Better dump (option -D). Now, there is more info to show.

Finally, keep in mind

1- It is a very alpha version. Use it for testing purposes and if it is
useful to you give me some feedback in the swish-e discussion list.

2- It is not compatible with earlier versions. This means that you MUST
reindex
your data.

By the way, this is FREE code, look at swish-e licence for more info.

Jose Manuel Ruiz Ramos

jmruiz@boe.es
Spain
Received on Mon Apr 10 11:44:30 2000