Skip to main content.
home | support | download

Back to List Archive

Re: Inaccurate ranking?

From: <sam1600(at)not-real.iname.com>
Date: Thu Feb 25 1999 - 15:34:40 GMT
Mark,

Thanks a lot for getting back to me!

Sorry for taking so log to respond but I've
been obsessed with the swish ranking function ;-)
-

I had "IgnoreTotalWordCountWhenRanking = yes"

...when I sent my last email.  Your additional
code does make a Huge improvement over not using
it!

But as you know my results where still unfavorable.

So I decided to dive into the code myself and
see if I could improve the rank function.

I searched the web for a rank function and found
a couple of resources.

It appears that Dr. Dik L. Lee:
http://www.cs.ust.hk/faculty/dlee/bio.html

... had a large part in "Document Ranking and the
Vector-Space Mode" ( see his page above for
downloadable publications on the topic.

There is also another page authored in part by
Dr. Lee on the topic of ranking which I found
at W3C:
http://www.w3.org/Conferences/WWW4/Papers/66/

I got lucky when I found that page because the
image of the scientific notation for the
"ranking algorithm" has the algebraic equation
as it's <Alt> text. ( My math skill are no longer
up to par ;-)

Here is Dr. Lee's equation and some explanation of the
variables ( as taken from the www.w3.org page ):

-----------
R(i,Q) = Sum (for all term(j) in Q)(0.5 + 0.5 IDF(j)TF(i,j)/TF(i,max)

where
TF(i,j) is the term frequency of term(j) in document(i), and

TF(i,max) is:
the maximum term frequency of a keyword in document(i) and

IDF(j) is the inverse document frequency of term(j),
which is defined as in Equation 2 below:
IDF(j) = log(N/DF(j))

where N is the number of documents in the collection, and
DF(j) is the number of documents containing term(j)

-------------

I think the above equation is a bit different than the one
already used already in Swish.. The current getrank
function does not include the "total number of files"
and the " total number of files containg only the query word"

So here is what I added/changed to the Swish index.c and
index.h files...:

<BEGIN NEW GETRANK FUNCTION>:
int getrank(totlfiles, nmfileswithword, freq, tfreq, words, emphasized)

/* totlfiles=total number of files */
int totlfiles;
/* nmfileswithword=sum of only files containing query word*/
int nmfileswithword;
/* freq=sum of queryword in this ONE file containing queryword */
int freq;
/* tfreq=sum of queryword in ALL files containing queryword */
int tfreq;

int words;
int emphasized;
{

double inversefreq, f;
 int tmprank;

/*
**my redering of the function found on
**http://www.w3.org/Conferences/WWW4/Papers/66/
*/
inversefreq = log(totlfiles/nmfileswithword);
f = ((tfreq) * (0.5 + (0.5 * (freq/50)))) * inversefreq;

tmprank = (int) f;
 if (tmprank <= 0)
  tmprank = 1;
 if (emphasized)
  tmprank *= emphasized;
 if (!(tmprank % 128))
  tmprank++;

 return tmprank;
}
<END NEW GETRANK FUNCTION>

So I also added/changed the following to the printindex function:

<BEGIN PRINTINDEX FUNCTION changes/additions>:

 int numfileswithword, myfilep thetotalfiles;

/*added this loop to count only the number of files containg the queryword */

numfileswithword = 0;
while (lp != NULL) {
 numfileswithword++;
 lp = lp->next;
 }

/* Here I set lp back to what it was before my loop */

lp = ep->locationlist;

/* Here I try to get the total number of files/pages
** in the whole document structure...
** I really don't know if my variable "thetotalfiles"
** is being set because I don't know if the variable
** "filelist" is set here inside the printindex function.
** I don't do anything with filelist other than include
** it here:
*/

   myfilep = filelist;
   thetotalfiles = getfilecount(myfilep);

/*Here I added the new parameters to the getrank function call */

rank = getrank(thetotalfiles, numfileswithword, lp->frequency,
ep->tfrequency, totalWords, lp->emphasized);

<END PRINTINDEX FUNCTION changes/additions>

And finally I added a couple of variable declarations to the
index.h file getrank declaration:

int getrank _AP ((int, int, int, int, int, int));

So,  As you can see My C skills are lacking :-0

I might have made some stupid errors but it did compile.
The ranking has NOT improved!  Maybe the index file
is totally screwed up ( it does look a lot different than
the old one)

As you can see I have left out a bit of stuff from the rank
function... including "IgnoreTotalWordCountWhenRanking".
I thought I'd test it raw.

Mark, can you ( or anyone else reading this ) see my
mistakes?  Or could make improvements?

Also Dr. Lee has a WEALTH of information on
ranking algorithms on his page avaliable for download.
(in Postscript format)
Especially the following:

1) "Document Ranking and the Vector-Space Model"
2) "Search and Ranking Algorithms for Locating Resources
on the World Wide Web"
3) "Implementation of Partial Document Ranking Using
Inverted Files"

Dr Lee's ranking equations in those files mentioned
above are much more complicated than the one I used.
( well I can't figure them out anyway ;-)

If there are any math whiz's out their maybe you could take
a look and convert the equations into "simple" equations for
me.

Dr. lee's files are in postscript so you need a viewer to
view them.  Download Ghostscript at:
http://www.cs.wisc.edu/~ghost/cd.html

I look forward to hearing from you all.

Thanks,

Sam


 ---- you wrote: 
> Hi
> The ranking is "complex". It uses the total number of words in
> a file to spread out the "weight" of any given word more "evenly."
> This behavior did not work for me so I added a new directive
> called "IgnoreTotalWordCountWhenRanking". You should see t
> this (commented out) in your config file. Uncomment it and
> set it to "yes", then reindex.  This will cause the rank to be more 
> in line with word count. Try this and see if it helps.
> 	Mark
> 
> 
> At 03:39 PM 2/20/99 -0800, sam1600@iname.com wrote:
> >Hello,
> >
> >Sorry for posting what may be a blatantly newbie
> >comment/question ;-) but for some reason a search
> >on a particular keyword always returns an inaccurate
> >ranking.  This keyword "gmc" occurs at least twice
> >on every page ( once in a metatag and once in a link )
> >but more than ten times on a particular page.
> >( I have a search box on every page, and why
> >people search for this word when there is a link is
> >beyond me but they do anyway ;-)
> >
> >It is a small site with only a few dozen pages
> >and if I search for this keyword that I know
> >for sure occurs on a certain page more
> >times than other pages, the said page is
> >ranked far down the list.
> >
> >The command line is simple with just
> >the -f -w and -m options specified.
> >
> >I've read in the mailing list that the ran
> >algorithm takes a few things into account
> >when ranking but I just can't see why it
> >would override the total number of
> >occurrences of a keyword as the most important
> >criteria.
> >
> >I've been using Swish ( not Swish-e ) and have
> >been logging the visitors search keywords
> >( and this keyword is a popular one... hence
> >the reason for me testing it ).  I'm not obsessed
> >with this keyword ;-) i'm just curious how often the
> >same inacurate ranking is occurring on other words also.
> >
> >Oh, by the way.  This bad ranking is NOT occurring
> >with the old Swish.
> >
> >Comments anyone?
> >
> >Thanks,
> >
> >Sam
> >
> >
> >
> >----------------------------------------------------------------
> >Get your free email from AltaVista at http://altavista.iname.com
> > 
> 


----------------------------------------------------------------
Get your free email from AltaVista at http://altavista.iname.com
Received on Thu Feb 25 07:33:35 1999