Skip to main content.
home | support | download

Back to List Archive

Ranking algorithm is flawed when more than two search terms...

From: Mark Gaulin <gaulin(at)not-real.designinfo.com>
Date: Wed Aug 12 1998 - 15:56:46 GMT
Hi
Ranking has got to be a bit of a black art, so I am not about to criticize the
exact ranking weights, etc, but there does seem to be a basic flaw in the way 
swish-e 1.1 combines the ranks when doing searches of three or more search
words.

Below is the "proof", but I'll start with the conclusion:
When searching for three words (using AND), the final rank of a file will be 
rank(word1) * 25% + rank(word2) * 25% + rank(word3) * 50%
and this is counter intuitive.
The desired ranking (in my opinion) is to weight each of the word ranks at
33%,
and that is what I plan on implementing.  Any other suggestions?

Proof: Every word in a file has a rank (call it "rank(file, word)"). 
When searching for two words the steps are to find all of the files that
contain 
word1 (store in list1), then find the files that contain word2 (store in
list2), and 
then do the AND. To do the AND you find all of the files that are in both
list1 
and list2 and store them in list1&2. Simple.
The trick is computing a new "rank" value when you take a given file from
list1
(where the file is ranked using "word1") and list2 (where the same file is
ranked
using "word2") and place it in list1&2. The current algorithm averages to
two ranks.
So, the rank of "file1" would look like this:
rank(file1, "word1 AND word2") = 
	rank(file1, "word1") * 50% + rank(file1, "word2") * 50%
That feels ok, intuitively... we weight each of the words "the same".

Trouble: When there are three words (word1, word2, word3) the algorithm will
do all of the above steps with list1 and list2, and it will create list1&2
exactly
the same way, yielding the exact same rank for file1. 
Then we produce list3, which has all files that contain word3 and use the same
ANDing technique above to create list(1&2)&3. And then we rank using
the averaging technique:
rank(file1, "word1+word2 AND word3") =
	rank(file1, "word1+word2") * 50% + rank(file1, "word3") * 50%
but rank(file1, "word1+word2") = 
	the previously computed rank  =
	rank(file1, "word1") * 50% + rank(file1, "word2") * 50%
so the final ranking for file1 is
rank(file1, "word1+word2 AND work3") =
	(rank(file1, "word1") * 50% + rank(file1, "word2") * 50% )* 50% + 
	rank(file1, "word3") * 50%
and so we are weighing the rank of word3 at 50% while word1 and word2 are
only weighed at 25%.... the last search term is always weighed at 50% and
the other terms are always spread out over the remaining 50%.

	Mark
Received on Wed Aug 12 09:06:18 1998