At 01:37 PM 04/30/01, David Wood wrote:
>Would you mind posting the actual code changes you made to index.c? I'd
>like to give it a try as well.
The piece of code now looks like:
/* if (freq < 5) Note #1
freq = 5;
*/
d = 1.0 / (double) tfreq;
e = (log((double) freq) + 10.0) * d;
if (ignoreTotalWordCountWhenRanking) Note #2
{
/* scale the rank down a bit. a larger has the effect of
making small differences in work frequency wash out */
e /= 100;
}
else
{
e /= words;
}
f = e * 10000.0 * 100.0; Note #3
I didn't add any comments, this is just in my private version at the
moment... This addresses three issues:
1) ranking was computed the same whether a word occurred in a file 1 time
or 5 times...
2) sense of ignoreTotalWordCountWhenRanking is backwards
3) The computation of rank, when the word is found in more than 100 or so
files, ends up such a small integer as to wash out all differences.
Comments about the above:
1) I don't know why matches less than 5 were set to 5. I don't know if this
creates any strange problems in the ranking, but in my limited testing
improved the distribution of ranking. In my opinion, a document with 2 hits
should be ranked below a document with 5 hits! The above change makes that
happen...
2) This is clearly the right fix, except it will break everyone, since the
sense is backwards. I found that dividing by the number of words so heavily
weights the ranking towards smaller documents (even when others have title
matches, or more word matches), that it gave a very inaccurate rank.
3) This is only a temporary work around. Note that the large ranks that can
come out of this routine are OK, because in the end the highest rank is
normalized to 1000 anyway. This is just to prevent ranks like 1.34231 and
1.53422 from both turning into 1.
I personally found a positive effect from each of these changes, and all of
the changed taken together gave far superior results, especially when
looking for multiple words and/or words that appear in your files somewhat
frequently.
My only theory about "how could this ever have worked" was that people must
have primarily been doing searches for words that only appeared in a few
files. In this case, swish would find them all, but their ranks relative to
each other would be nearly irrelevant...
Some of us also feel that the ranking for a hit in titles and the header is
too high. Currently it boosts the overall rank by a factor of 5. But for
now, I didn't try changing that.
I'd be curious if others try this and see how their searching and ranking
improve. Be sure that you DO use "IgnoreTotalWordCountWhenRanking yes" !!!
(which you probably had before, but now it does what you wanted ;-)
Try -H 9 -- you can see the raw ranking numbers this way. Try it before and
after this change!
Bill
P.S. If you try this, and think it helps your searches, let us know!
Received on Mon Apr 30 18:11:46 2001