Skip to main content.
home | support | download

Back to List Archive

Re: Ranking, even with strong bias

From: Mark Maunder <mark(at)not-real.workzoo.com>
Date: Fri Feb 04 2005 - 16:15:41 GMT
Sounds like you're going to miss a lot of results that should have been
returned from hits in the 2nd paragraph onwards. I know this has been
mentioned before, but RankScheme(1) really solved the problem of big
chunks of text always winning for me. 

As an aside, at first I was freaked out by titles that contained
keywords being ranked lower than body text hits. But after actually
looking at the results, I discovered the documents were generally at
least as relevant as the title hits. 

On Fri, 2005-02-04 at 07:40 -0800, Tac Tacelosky wrote:
> Thanks for the many suggestions from this list.  The "hack" I got to work
> for my application was to only index the first "paragraph" (loosely
> defined), which shortened the description (and the relevant words were
> generally near the top).  Most descriptions were then the same length, which
> evened out the problem of the big ones always winning.
> 
> I like the title repetition idea, too, I may try that next round (though the
> bias adjustments should do that, and maybe they do but it's still not in the
> merged indexes).
> 
> Thanks again, everyone, for the ideas, it's been an interesting discussion!
> 
> Tac
> 
> -----Original Message-----
> From: swish-e@sunsite3.berkeley.edu [mailto:swish-e@sunsite3.berkeley.edu]
> On Behalf Of Bill Moseley
> Sent: Friday, February 04, 2005 10:13 AM
> To: Multiple recipients of list
> Subject: [SWISH-E] Re: Ranking, even with strong bias
> 
> 
> On Fri, Feb 04, 2005 at 03:23:38AM -0800, Thomas R. Bruce wrote:
> > Peter Karman wrote:
> > 
> > >indexing as html will artificially inflate the number of occurances 
> > >whenever a
> > >word matches in the <title>.
> > >  
> > >
> > This does help, but not enough for some applications.  A real problem 
> > with relevance-ranked searches of collections of judicial opinions is 
> > that it's hard to force title weight high enough to overcome large 
> > numbers of term-occurrences in the body text
> 
> Yes, that's the problem with our relatively simple ranking system. Once I
> hacked rank.c to just not count word frequency over some reasonably small
> number and that keep the huge docs from always winning.
> 
> Sometimes it's not that helpful to search for a term "foo" and be told that,
> yes, it is in that 100 page document.  So another approach is to split your
> docs into smaller chunks. and index them separately.  And if you can link
> into sections of your docs (like with URI #fragments) 
> then your search results are even more targeted.  That can help with ranking
> a bit, but doesn't help much if you are searching for a common term.
> 
> Sounds like you need a better ranking system in general -- something that
> tries to figure out what a document is *about*.
> 
> > Anyway, our cheap kludge for dealing with this is to run a title-only 
> > search separately and prepend those results to the hit list for 
> > full-text search.  We tried jiggering the rankings as described in 
> > this thread and it helped, but not enough.
> 
> Does that mean if you have a word hit in the title then it will always be on
> top of results without a word hit in the title?  So a very common word in
> the title would still bring it to the top?
> 
> One thing I would suggest (not really related to above) is to use -T to dump
> your index (of maybe a small set of files) and look over the words swish is
> indexing.  You might want to filter your queries of common words for your
> corpus when they are not used in an explicit phrase search.
> 
Received on Fri Feb 4 08:15:47 2005