Skip to main content.
home | support | download

Back to List Archive

RE: Word Locations

From: <jmruiz(at)not-real.boe.es>
Date: Thu Mar 01 2001 - 10:14:59 GMT
Hi Scott,

On 28 Feb 2001, at 16:22, Scott Schultz wrote:

> Okay, I admit it. Trying to understand the Swish-E
> source code makes my head swim.
> 

I am really sorry. Probably it is my fault becouse I began to work in 
swish to adapt the old 1.3.2 code to my own needs and to make 
index proccess much faster. Swish 1.3.X is terribly slow both in 
index and search with large collections of data.
Now, Rainer Scherg, Bill Moseley and me are currently working on 
it. There are many new features. The latest version is always at CVS 
in www.sourceforfe.net. I Hope to release 2.2 soon.

> Does the location structure used to store the location
> of individual words? Is this where the "position" 
> variable in the results list elements comes from? 
> 

Yep, this is the structure you are lookinf for. For each entry there is 
an array of structures of this type.
typedef struct {
        int metaID;
        int filenum;
        int structure;
        int frequency;   <== This is the number of occurrences of the 
                                        word in the file
        int position[1];  <== This are the positions. Array from
                                        0 to frequency-1
} LOCATION;

BTW, position contains relative position of words (word 1,2,3...) not 
offsets inside the file. In fact, each field/metaname has its own 
counter.

> In other words, is it possible to add some code to 
> swish-e that will return the offsets of the words that were
> successfully matched? This could be used by the wrapper
> scripts to do keyword hilighting. 
> 

Well, we have discussed this issue before. The feature you are 
proposing is is only possible for files and html pages but it is 
useless for external filtered documents Eg: PDF, Database Outputs 
(mySQL, Oracle...).

Another important thing to consider is that the stopwords are 
removed (if you have stopwords of course) and the positions could 
not match exactly the words to highlight.

Keep in mind that to search the word you have to consider the rules
(WordCharacters, etc) that you specified at index time to split in your 
cgi script the docs in the same way (do not forget stemming and/or 
soundex).

As you see, this is not very easy. 

Anyway, every new idea and help is always welcome.

cu
Jose
Received on Thu Mar 1 10:24:00 2001