Skip to main content.
home | support | download

Back to List Archive

Re: SWISH ranking vs. AltaVista

From: Bill Meier <bill(at)not-real.insulators.com>
Date: Sun Apr 29 2001 - 22:21:48 GMT
At 07:42 PM 04/28/01, Bill Moseley wrote:
>Two points.  You are searching your site, so I'd expect that searching for
>"insulators" might turn up a lot of hits for "insulators". ...
>
>Second, phrase searches (which means word positions) are new to swish.  So
>word proximity hasn't been taken into account (as far as I know) ... 
>[correct, they aren't yet]
>
>All that being said, if you have the time, take a look at the code and come
>up with some good algorithms.  Open source projects can always use some
>good help, and there's not doubt at all that the ranking can't be vastly
>improved with swish.

I looked at the ranking code, did some debugging, and found a few serious 
problems...

Basically, it is a simple loss of precision. In index.c function getrank 
computes a rank as a double, and then converts it to an integer, after 
scaling it by 10,000 -- well, it happens that if you are search for word 
that occur often, say on 100 or more pages (which is factored into the 
rank), the floating point computation of rank yields very small numbers, 
that when multiplied by 10,000 yield integers in the single digit range. 
You just lost all your precision...

In  my example of "Insulator Shows" insulator occurred in 887 files, and 
the rank was around 1.0 - 1.5; and shows occurred in 350 files, with a rank 
from 2.0 - 3.0 So, what you got back, was little more than a list of files 
that contained the words "insulator" and "shows" in some arbitrary order.

This is not a complete fix, but I simply increased the multiplier by 100, 
so now rather than turning 1.51634 into a rank of 1, it turns it into 
151... And, that ALONE made an incredible difference!

Now, on the top of the search is "Insulator Show Reports" and a little 
further down "Insulator Show Calendar", as I would expect (and hope).

I did try several other searches for single and multiple words, and the 
results are significantly better!

So, swish DID have the internal information to rank this "and" query of two 
words that appeared often. It just lost all that information in the 
conversion of a double to an integer...

There IS hope! :-)

Also, the sense of "IgnoreTotalWordCountWhenRanking" is backwards as you 
can see from the following piece of code:

     if (ignoreTotalWordCountWhenRanking)
     {
         e /= words;
     }
     else
     {
         e /= 100;
     }

If it is true, it modifies the rank by dividing by the number of words! It 
is off by default, but several people suggested I turn it on when I was 
having problems with the ranking. Little did we know at the time, that this 
only made it worse...

I'll have some more information when I do so more work in this area. And, 
when I am more able to get involved in this project.

Bill
Received on Sun Apr 29 22:22:46 2001