Skip to main content.
home | support | download

Back to List Archive

Re: ranking change

From: Peter Karman <karman(at)not-real.cray.com>
Date: Wed Aug 04 2004 - 19:14:58 GMT
Bill Moseley wrote on 8/4/04 1:03 PM:

> On Wed, Aug 04, 2004 at 04:08:09AM -0700, Peter Karman wrote:
> 
>>For example, if the word 'the' appears in 98% of the docs in your index, 
>>it will have an IDF of 1. If the word 'foo' appears in 10% of your docs, 
>>it will have an IDF of something greater than 1 (something like 5 or 6, 
>>depending on the math, number of docs, etc.). So for a query of 'the 
>>foo', docs with more instances of 'foo' will rank relatively higher than 
>>docs with fewer instances of 'foo', while instances of 'the' will affect 
>>ranking much the same way they do now (that is to say, not much).
> 
> 
> Does this effect this config option?
> 
>    IgnoreTotalWordCountWhenRanking
> 

Not for this IDF feature. Ideally, total word count would be used to 
calculate word density and to normalize a document for length (really, 
number of words). So IgnoreTotalWordCountWhenRanking would need to be 
set to 0 for that to work (or word count stored in the index no matter 
what but ignored for ranking -- it seems not to be stored in the index 
at all if IgnoreTotalWordCountWhenRanking is set to 1). I've been 
fooling with that but haven't had time to really test the results.



> 
>>IDF has a similar effect to IgnoreLimit or StopWords, but on a smoother 
>>scale. A word isn't just in or out (a StopWord or not), but rather has a 
>>relative weight compared to all the other word in the index.
> 
> 
> But, that's not implemented, right?  So is the idea that stopwords
> just have a much lower score?

Sorry, I don't follow. What's not implemented? The 'smooth' effect 
should be felt with the new IDF feature.

But yes, you're right about the idea: stopwords are still counted but 
just have a much lower score. That allows you to still find exact 
phrases like "the foo" as opposed to "a foo" but rank/weight is adjusted 
per word.


> 
> 
> 
>>I have several other new ranking features in the works, but wanted to 
>>get some feedback for this one before I move ahead too much in this 
>>direction. Other features might include:
>>
>>	normalizing weight for word density/document length
>>	scaling the IDF to allow for greater granularity in difference
>>	weighting words based on their proximity to other query words
> 
> 
> That last one would be nice -- if that worked well then the default
> search might be "OR", but the "ANDed" results get ranked much higher.
> 

Yes. Though I think that a true vector ranking scheme would have to be 
used instead of the current AND/OR system. I don't mean you couldn't 
have both (vector ranking and boolean AND/OR) but as I understand it, 
vector ranking uses document *similarity* to a query. It's much fuzzier 
than the strict AND/OR boolean system swish currently uses. I guess it's 
more like OR, with the effect you describe: ANDed results rank higher.

Vector ranking claims to mimic natural language query better than 
boolean does.

More plainly, a vector-rank search would look like:

find all docs that any of the query words (i.e., an OR search)
within that subset, calculate a vector for each doc
calculate a vector for the query words
compare the query vector with each doc vector and return only those docs 
similar enough to merit inclusion (< threshold).

So I imagine config settings like:

UseVectorRanking 0|1
VectorThreshold *integer*

where a threshold of 0 returns everything that matches the OR search.
That would be most useful for purely HTML text searches. Anyone using 
swish to index XML and/or database output would likely want:

UseVectorRanking 0

and instead rely on the strict boolean AND/OR swish currently uses. 
UseVectorRanking should probably default to 0.

I need to get some of the other stuff working better before I tackle the 
whole vector deal, though. That's months off.

-- 
Peter Karman - Software Publications Engineer - Cray Inc
phone: 651-605-9009 - mailto:karman@cray.com
Received on Wed Aug 4 12:15:16 2004