Skip to main content.
home | support | download

Back to List Archive

Re: SWISH ranking vs. AltaVista

From: David Wood <dwood(at)not-real.inter.nl.net>
Date: Mon Apr 30 2001 - 17:40:42 GMT
Hi Bill,

Would you mind posting the actual code changes you made to index.c?  I'd 
like to give it a try as well.

Did you just change:

f = e * 10000.0;
to
f = e * 1000000.0;

and

if (ignoreTotalWordCountWhenRanking)
to
if (!ignoreTotalWordCountWhenRanking)

?

thanks,

David


At 00:21 30-04-01, Bill Meier wrote:
>At 07:42 PM 04/28/01, Bill Moseley wrote:
>>Two points.  You are searching your site, so I'd expect that searching for
>>"insulators" might turn up a lot of hits for "insulators". ...
>>
>>Second, phrase searches (which means word positions) are new to swish.  So
>>word proximity hasn't been taken into account (as far as I know) ... 
>>[correct, they aren't yet]
>>
>>All that being said, if you have the time, take a look at the code and come
>>up with some good algorithms.  Open source projects can always use some
>>good help, and there's not doubt at all that the ranking can't be vastly
>>improved with swish.
>
>I looked at the ranking code, did some debugging, and found a few serious 
>problems...
>
>Basically, it is a simple loss of precision. In index.c function getrank 
>computes a rank as a double, and then converts it to an integer, after 
>scaling it by 10,000 -- well, it happens that if you are search for word 
>that occur often, say on 100 or more pages (which is factored into the 
>rank), the floating point computation of rank yields very small numbers, 
>that when multiplied by 10,000 yield integers in the single digit range. 
>You just lost all your precision...
>
>In  my example of "Insulator Shows" insulator occurred in 887 files, and 
>the rank was around 1.0 - 1.5; and shows occurred in 350 files, with a 
>rank from 2.0 - 3.0 So, what you got back, was little more than a list of 
>files that contained the words "insulator" and "shows" in some arbitrary order.
>
>This is not a complete fix, but I simply increased the multiplier by 100, 
>so now rather than turning 1.51634 into a rank of 1, it turns it into 
>151... And, that ALONE made an incredible difference!
>
>Now, on the top of the search is "Insulator Show Reports" and a little 
>further down "Insulator Show Calendar", as I would expect (and hope).
>
>I did try several other searches for single and multiple words, and the 
>results are significantly better!
>
>So, swish DID have the internal information to rank this "and" query of 
>two words that appeared often. It just lost all that information in the 
>conversion of a double to an integer...
>
>There IS hope! :-)
>
>Also, the sense of "IgnoreTotalWordCountWhenRanking" is backwards as you 
>can see from the following piece of code:
>
>     if (ignoreTotalWordCountWhenRanking)
>     {
>         e /= words;
>     }
>     else
>     {
>         e /= 100;
>     }
>
>If it is true, it modifies the rank by dividing by the number of words! It 
>is off by default, but several people suggested I turn it on when I was 
>having problems with the ranking. Little did we know at the time, that 
>this only made it worse...
>
>I'll have some more information when I do so more work in this area. And, 
>when I am more able to get involved in this project.
>
>Bill
Received on Mon Apr 30 17:45:39 2001