Skip to main content.
home | support | download

Back to List Archive

Re: ranking change

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Wed Aug 04 2004 - 19:22:42 GMT
On Wed, Aug 04, 2004 at 02:13:05PM -0500, Peter Karman wrote:
> >   IgnoreTotalWordCountWhenRanking
> >
> 
> Not for this IDF feature. Ideally, total word count would be used to 
> calculate word density and to normalize a document for length (really, 
> number of words). So IgnoreTotalWordCountWhenRanking would need to be 
> set to 0 for that to work (or word count stored in the index no matter 
> what but ignored for ranking -- it seems not to be stored in the index 
> at all if IgnoreTotalWordCountWhenRanking is set to 1). I've been 
> fooling with that but haven't had time to really test the results.

Right.  I have not looked at it in quite a while, but when enabled
(not ignored) then an extra table is created and must be read while
searching.


> >>IDF has a similar effect to IgnoreLimit or StopWords, but on a smoother 
> >>scale. A word isn't just in or out (a StopWord or not), but rather has a 
> >>relative weight compared to all the other word in the index.
> >
> >
> >But, that's not implemented, right?  So is the idea that stopwords
> >just have a much lower score?
> 
> Sorry, I don't follow. What's not implemented? The 'smooth' effect 
> should be felt with the new IDF feature.

I mean the part about stopwords being implemented.  Currently they are
still just removed/ignored while indexing and searching.  What you are
talking about is something in the future where stopwords are not
removed but just weighted much lower.

> But yes, you're right about the idea: stopwords are still counted but 
> just have a much lower score. That allows you to still find exact 
> phrases like "the foo" as opposed to "a foo" but rank/weight is adjusted 
> per word.

I've thought about having a system where stopwords are not removed on
indexing, but ignored while searching unless the stopword is part of a
phrase.

> More plainly, a vector-rank search would look like:
> 
> find all docs that any of the query words (i.e., an OR search)
> within that subset, calculate a vector for each doc
> calculate a vector for the query words
> compare the query vector with each doc vector and return only those docs 
> similar enough to merit inclusion (< threshold).

How's that vector computed?



> So I imagine config settings like:
> 
> UseVectorRanking 0|1
> VectorThreshold *integer*

Or maybe a switch/option used at search time?

Thanks,


-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Wed Aug 4 12:23:04 2004