Skip to main content.
home | support | download

Back to List Archive

Re: Title matches on result top

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Tue Mar 09 2004 - 13:47:20 GMT
On Tue, Mar 09, 2004 at 12:38:17AM -0800, redna@euskalerria.org wrote:

> >You could try tweaking those, but the other problem is that swish
> >considers to some degree the number of hits in a file, so a large file
> >may out-rank a smaller file with the word in the title.
> 
> Does not swish-e convert frequencys into percents?? Would it be a bad idea?

You should look at rank.c.  That and the query processing are 
long-standing problems that need attention.  Ranking is very basic 
currently.

There's a mode to consider the length of the document in the rank
calculations but when I tested the feature it didn't seem to make much
difference in the ranking -- and in some cases made it worse.

It's subjective, of course.  What I did was index a few small (< 10,000 
pages) sites and then compare search results with google.  I spent a day 
playing with small tweaks to rank.c and it was clear that very large 
files throw off the rank.  One true hack was to limit the number of 
word hits per document and that one thing alone made the results match 
more like how google ranked.  I just limited the frequency count to 100.  
How's that for an ugly hack?

I had also tried limiting the counts to the first X word positions but 
with less of an effect.  I was expecting that to have more of an effect.
If you are looking for a document about something you might think that 
it would be discussed early on in the document.

Swish-e has been used for indexing reasonably small sets of documents,
so effective searching is often as helpful as is the ranking.  Still, I
hope someone comes along that knows something about ranking and has some
time that can update swish-e's code.
 

-- 
Bill Moseley
moseley@hank.org
Received on Tue Mar 9 05:47:30 2004