Skip to main content.
home | support | download

Back to List Archive

Re: Ranking order based on file extension

From: Peter Karman <peter(at)not-real.peknet.com>
Date: Thu Jul 20 2006 - 03:40:43 GMT
what you are describing is a situation I have had before. It's not a Swish-e 
problem; it's a document management problem.

The solution is actually quite simple: only index one format of each document, 
then offer the user the opportunity to read it in alternate formats.

Since you don't manage the document collection, it can be a bit harder to 
achieve that solution, since the collection may not be organized in such a way 
that makes it easy to tell if you are looking at mutiple formats of the same 
document. Here's one solution someone found:

http://swish-e.org/archive/2005-02/8998.html

As for a single html document being split across several files, I had the same 
situation. My solution was to create a virtual composite of those html files 
(order was irrelevant) and fed to swish-e -S prog as a single document. That way 
the PDF vs HTML issue was moot.

good luck,

pek



Shivakumar GN scribbled on 7/19/06 9:05 PM:
> On Sun, 2006-07-16 at 06:56 -0700, Bill Moseley wrote:
> 
>>> I am using swish to search a large repository of files that are in
>>> html,pdf & doc format and serve the search results to the web clients.
>>> I have a requirement to reduce the ranking of a file if it has pdf or
>>> doc extension.
>> Would have been fun to be in that meeting.  Everyone knows it's not
>> the content that's important but the container.  Much of the U.S.
>> consumer economy is based on that.
>>
> 
> It is not just a usability problem, there is a technical problem as well
> leading to irrelevant search results. Point #2 below describes the
> problem of incorrect ranking.
> 
> 1. I have documentation that are duplicated in html and pdf. Also not
> all documentation is duplicated and it is difficult to remove the
> duplication since the repository is large and I am not the producer of
> it.
> 2. PDF is a monolithic book where as the same content in html is
> distributed into many pages in the form of chapters. Because of this a
> single PDF has a higher frequency of occurance of search words than
> html. Thus a whole bunch of PDFs invariably appear at the top of the
> search result (even the not so relevant ones).
> 3. Even if searches is able to find both html and PDF I would prefer
> html to come early since it is browseable. Web based experienced is not
> lost.
> 
> 
>>> From swish documentation or the discussion archives, I couldn't find
>>> any details along these lines.
>>> Can the rank order be customized based on file extension, if so how
>>> can this be done.
>> You are right.  Swish doesn't give you a way to implement such a
>> thing.  You would need to either modify the source code or sort your
>> results first by file extension (ExtractPath) then by rank and show
>> your results in groups.
>>
> 
> For the short term, sorting the results by extension and then rank seems
> straight forward and quick to do. Will do this and checkout how good the
> results remain. Will look into the source if this is not satisfactory.
> 
> I also found that providing different categories (multiple index files
> for different documentation) to search from using check boxes reduced
> the amount of irrelevance (though original problem remains). But this
> approach does not bring out the quality of the extra-ordinary tool that
> swish is.
> 
> thanks.
> Shiv
> 
> 
> 

-- 
Peter Karman  .  http://peknet.com/  .  peter(at)not-real.peknet.com
Received on Wed Jul 19 20:40:46 2006