At 07:42 PM 04/28/01, Bill Moseley wrote:
>Two points. You are searching your site, so I'd expect that searching for
>"insulators" might turn up a lot of hits for "insulators". ...
>
>Second, phrase searches (which means word positions) are new to swish. So
>word proximity hasn't been taken into account (as far as I know) ...
>[correct, they aren't yet]
>
>All that being said, if you have the time, take a look at the code and come
>up with some good algorithms. Open source projects can always use some
>good help, and there's not doubt at all that the ranking can't be vastly
>improved with swish.
I looked at the ranking code, did some debugging, and found a few serious
problems...
Basically, it is a simple loss of precision. In index.c function getrank
computes a rank as a double, and then converts it to an integer, after
scaling it by 10,000 -- well, it happens that if you are search for word
that occur often, say on 100 or more pages (which is factored into the
rank), the floating point computation of rank yields very small numbers,
that when multiplied by 10,000 yield integers in the single digit range.
You just lost all your precision...
In my example of "Insulator Shows" insulator occurred in 887 files, and
the rank was around 1.0 - 1.5; and shows occurred in 350 files, with a rank
from 2.0 - 3.0 So, what you got back, was little more than a list of files
that contained the words "insulator" and "shows" in some arbitrary order.
This is not a complete fix, but I simply increased the multiplier by 100,
so now rather than turning 1.51634 into a rank of 1, it turns it into
151... And, that ALONE made an incredible difference!
Now, on the top of the search is "Insulator Show Reports" and a little
further down "Insulator Show Calendar", as I would expect (and hope).
I did try several other searches for single and multiple words, and the
results are significantly better!
So, swish DID have the internal information to rank this "and" query of two
words that appeared often. It just lost all that information in the
conversion of a double to an integer...
There IS hope! :-)
Also, the sense of "IgnoreTotalWordCountWhenRanking" is backwards as you
can see from the following piece of code:
if (ignoreTotalWordCountWhenRanking)
{
e /= words;
}
else
{
e /= 100;
}
If it is true, it modifies the rank by dividing by the number of words! It
is off by default, but several people suggested I turn it on when I was
having problems with the ranking. Little did we know at the time, that this
only made it worse...
I'll have some more information when I do so more work in this area. And,
when I am more able to get involved in this project.
Bill
Received on Sun Apr 29 22:22:46 2001