Skip to main content.
home | support | download

Back to List Archive

Re: Rank values. How are they generated?

From: <moseley(at)not-real.hank.org>
Date: Thu Sep 04 2003 - 15:16:54 GMT
On Thu, Sep 04, 2003 at 02:14:51AM -0700, William Bailey wrote:

> "For the FAQ we just need some general search score info rather than anything 
> specific."
> 
> 	Now apart from saying "The most relevant should have a higher score." i don't 
> exactly know what to say.
> 
> 	Now the data that is being searched is both large and has a lot of meta 
> fields defined so how will this affect the score? If required i can post 
> sample data as well as config files.

I suspect this is in the list archives some place.

No analysis is done to determine what are the "keywords" in a document 
-- it's just word frequency that sets the rank.

Look at rank.c.  The rank is a rather simplistic calculation.  Rank is
done for each word in the query.  For HTML there's a point value for
where a word is i.e. in <title> or <b>, etc. (non-HTML docs are have the
same point value) and there is also a point value based on what meta
name is used (rank bias, and defaults to zero) and the total rank is
just the sum of the point values.  The log() is taken of this value to
try and limit the effect of very large documents.

That seems to work reasonably well for smallish sets of documents that 
are similar in size (such as a collection of web pages).

I spent one day searching a somewhat larger collection (60,000 docs) and
comparing it with the same search on google.  Google's page rank was 
clearly having a strong impact on search results because documentation 
and tutorial type pages were often listed first.

Anyway, swish tended to return very large documents first just due to 
the number of search term hits.  Swish had indexed smaller web pages but 
also mailing list archives where a single mbox file could be a few 
megabytes.

I tried a number of changes to rank.c and didn't really see much change 
until I then just limited the word frequency to 100 (a totally arbitrary 
number) and then those huge documents didn't effect results so much.  
Results were a lot closer to Google's.

I didn't expect such a simplistic method to stay in the code, but 
testing with a few other collections of documents I had similar results.

[Jose, with that system we could instead limit word frequency on 
indexing and reduce index size, perhaps.]

I've often thought word position would also be a good thing to add into 
the calculation.  If trying to find a document about something (instead 
of a document that contains something) the terms found early in the 
document should be worth more.

AND and OR results just combine ranks.  AND does a running average where 
OR, IIRC, sums them up so that foo OR bar should rank docs with both 
higher (but again, depends on term frequency).

AND searches really should also adjust rank based on how close words are 
together as people often search for phrase without using quotes.

I've asked for help on this list and other places for help with rank 
redesign.  Would be a great project for a graduate student.

> 	I know the use is probably not what swish was designed for but it does the 
> job well although the only feature I'm missing is to search for a range of 
> values. I know it can be done with the -L but that only applies to properties 
> and therefore 1 value per file which is not enough for my requirements as i 
> would like to order the results by a field that could occur more then once 
> i.e. release dates. Anyway before i get even more off topic :)

Searching for a range of values is really more a database function.  I 
don't really like the -L feature as it's not very scalable.  It just 
takes all the documents sorted by that property, inverts that table so 
swish-e can lookup by file number and then consults that table to filter 
out results.

> 	For reference here is a typical query along with swish output...
> 
> User searches for:
> 	* artist: "Black Sabbath"
> 	* include compilation recordings in artist search.
> 	* track: iron man
> 	* format: CD
> 	* order: Search relevance (highest  lowest)

Looks more like a database select than a full-text search.

> The following command get run:
> 
> /usr/local/bin/swish-e -H 9 -d\\t -w '(  (  recording.artist.main=( black 
> sabbath )  OR  recording.track.artist.main=( black sabbath )  OR  
> recording.artist.main.md5=(b1dd10efa6a2761536d12edc20edeca9)  OR  
> recording.track.artist.main.md5=(b1dd10efa6a2761536d12edc20edeca9)  )  AND  
> recording.track.title=(iron man)  AND  recording.media.available.group=( -cd- 
> )  AND  recording.available=( yes )  AND  recording.chanel=(musicmaster)  )'  
> -s swishrank desc recording.title asc recording.artist.main asc -b 0 -m 3000 
> -f /usr/home/wb/Web/Work/red-phase3/_server/data/swish/data.index

Each AND (including the default AND operator) and OR operation is a new 
search.  So reducing the boolean searches would be good for speed.

Are you using the md5 keys for exact matches?  We have talked about 
setting flags on the first and last words indexed in a metaname so you 
could do a phrase search for "Black Sabbath" where "Black" was the first 
word indexed and "Sabbath" the last, i.e. the metaname is exactly "Black 
Sabbath".


-- 
Bill Moseley
moseley@hank.org
Received on Thu Sep 4 15:17:14 2003