Skip to main content.
home | support | download

Back to List Archive

ranking ideas

From: Peter Karman <karman(at)not-real.cray.com>
Date: Fri Apr 16 2004 - 21:39:33 GMT
I've been thinking about future features for ranking. Here are some 
ideas I've had. I'm interested in any math whizes out there who can 
point out where my thinking is flawed (anyone else can point out my 
flawed thinking too; I'm an easy target...).

1. WordPositionBias *integer* *bias*

If the word position of a match is within the first *integer* words of 
the document, weight the relevance of that doc by *bias*. I imagine this 
would have a similar effect on ranking that MetaNamesRank does, but I'd 
have to test and see. (I'm basing this assumption on the MetaBias 
comments in rank.c.)

Example:

a search for 'foo AND bar' yields 100 hits. If:

WordPositionBias 50 +2

then for every occurance of 'foo' in the first 50 words of the doc, 
weight the doc's rank +2. Do the same for 'bar'.

2. RelativeFrequencyBias *percent* *bias* *max*

I know, this seems like newer new math (and my math was never 
outstanding; I'm a literary critic by training...). But consider this 
example and please tell me where my logic is wrong:

swish-e -w 'foo AND bar'
returns 100 hits out of a total 1000 docs. If you knew that 'foo' 
appeared in all 100 docs, but 'bar' appeared in only 8, then shouldn't 
docs with 'bar' in them weigh more (be more relevant) than docs with 
'foo'? How do we calculate that bias? How about:

RelativeFrequencyBias 8 +2 10

Because 'bar' frequency is <= the percent value (8), rank gets 
incremented by +2 for each occurance of 'bar' in each doc, up to a 
maximum of 10 occurances (to avoid weighting large docs unfairly).

Each doc with 'bar' in it would be up to 20 rank points higher, because 
'bar' occurs in such a small relative percentage of the total hits.

The default setting would be:

RelativeFrequencyBias 0 0 0

so it wouldn't affect anything by default, but users could play with the 
numbers to see if it gave the desired affect.


3. WordProximityBias *integer* *bias*

If 'foo' is within *integer* word positions from 'bar', bias that hit by 
*bias* times the inverse of the distance.

Example:

query: 'foo AND bar'
hits: 100
WordProximityBias 10 +2

in 5 of the docs, foo is 10 or less word positions away from bar, so 
increment the rank of those docs by 2 * ( 1 / $distance). and do that 
increment for every instance of foo that is <= 10 positions from bar.

or in perl-ese (NOTE I haven't actually run this code; this is just a
scribble):

my @words = qw(foo bar);
my $int = 10;
my $bias = 2;
my %docbias;
DOC: for my $doc (@hits) {
	my %positions;
	WORD: for my $word (@words) {
		# get_pos() returns array of word positions
		$positions{$_} = $word for ( get_pos($word,$doc) );
	}
	# get relationship between each position
	# and every other position
	my $rank_bump;
	POSITION1: for my $p1 (keys %positions) {
		my $w = $positions{$p1};
		POSITION2: for my $p2 (keys %positions) {
			next POSITION2 if $positions{$p2} eq $w;
			my $dist = ($p1 - $p2) * -1;
			if ($dist <= $int) {
				$rank_bump += $bias * (1 / $dist);
			}
		}
	}
	$docbias{$doc} = $rank_bump;
}


-- 
Peter Karman - Software Publications Engineer - Cray Inc
phone: 651-605-9009 - mailto:karman@cray.com
Received on Fri Apr 16 14:39:35 2004