Hi Bill,
Would you mind posting the actual code changes you made to index.c? I'd
like to give it a try as well.
Did you just change:
f = e * 10000.0;
to
f = e * 1000000.0;
and
if (ignoreTotalWordCountWhenRanking)
to
if (!ignoreTotalWordCountWhenRanking)
?
thanks,
David
At 00:21 30-04-01, Bill Meier wrote:
>At 07:42 PM 04/28/01, Bill Moseley wrote:
>>Two points. You are searching your site, so I'd expect that searching for
>>"insulators" might turn up a lot of hits for "insulators". ...
>>
>>Second, phrase searches (which means word positions) are new to swish. So
>>word proximity hasn't been taken into account (as far as I know) ...
>>[correct, they aren't yet]
>>
>>All that being said, if you have the time, take a look at the code and come
>>up with some good algorithms. Open source projects can always use some
>>good help, and there's not doubt at all that the ranking can't be vastly
>>improved with swish.
>
>I looked at the ranking code, did some debugging, and found a few serious
>problems...
>
>Basically, it is a simple loss of precision. In index.c function getrank
>computes a rank as a double, and then converts it to an integer, after
>scaling it by 10,000 -- well, it happens that if you are search for word
>that occur often, say on 100 or more pages (which is factored into the
>rank), the floating point computation of rank yields very small numbers,
>that when multiplied by 10,000 yield integers in the single digit range.
>You just lost all your precision...
>
>In my example of "Insulator Shows" insulator occurred in 887 files, and
>the rank was around 1.0 - 1.5; and shows occurred in 350 files, with a
>rank from 2.0 - 3.0 So, what you got back, was little more than a list of
>files that contained the words "insulator" and "shows" in some arbitrary order.
>
>This is not a complete fix, but I simply increased the multiplier by 100,
>so now rather than turning 1.51634 into a rank of 1, it turns it into
>151... And, that ALONE made an incredible difference!
>
>Now, on the top of the search is "Insulator Show Reports" and a little
>further down "Insulator Show Calendar", as I would expect (and hope).
>
>I did try several other searches for single and multiple words, and the
>results are significantly better!
>
>So, swish DID have the internal information to rank this "and" query of
>two words that appeared often. It just lost all that information in the
>conversion of a double to an integer...
>
>There IS hope! :-)
>
>Also, the sense of "IgnoreTotalWordCountWhenRanking" is backwards as you
>can see from the following piece of code:
>
> if (ignoreTotalWordCountWhenRanking)
> {
> e /= words;
> }
> else
> {
> e /= 100;
> }
>
>If it is true, it modifies the rank by dividing by the number of words! It
>is off by default, but several people suggested I turn it on when I was
>having problems with the ranking. Little did we know at the time, that
>this only made it worse...
>
>I'll have some more information when I do so more work in this area. And,
>when I am more able to get involved in this project.
>
>Bill
Received on Mon Apr 30 17:45:39 2001