Skip to main content.
home | support | download

Back to List Archive

Re: Ranking order based on file extension

From: Shivakumar GN <>
Date: Sat Jul 22 2006 - 03:26:14 GMT
On Wed, 2006-07-19 at 20:40 -0700, Peter Karman wrote:
> what you are describing is a situation I have had before. It's not a Swish-e 
> problem; it's a document management problem.
> The solution is actually quite simple: only index one format of each document, 
> then offer the user the opportunity to read it in alternate formats.

Whether the issue can be addressed through document management or
through extensions to the tool depends on the situation. And the
situation I mentioned is one where I would definitely prefer the
solution via tool (either swish or perl scripts).
Mine is a documentation repository that I have created by dumping
documentation from all the OEMs that our product depends upon. The
documentation changes/increases with release of new versions of the
OEMs. I dump the newer set of documentation and create the indexes.
I would prefer not to do "document management" since it will become an
ongoing effort. Instead I would prefer either 
[1] swish provide config facility or 
[2] I adapt the perl backend to get an acceptable functionality
This way it will be one time effort.

> Since you don't manage the document collection, it can be a bit harder to 
> achieve that solution, since the collection may not be organized in such a way 
> that makes it easy to tell if you are looking at mutiple formats of the same 
> document. Here's one solution someone found:
Interesting. Often in the kind of documentation I mentioned
1 PDF = Multiple html files
It would be interesting to know if there is an algorithm that can
identify duplicates given a repository of information. Comparing 2
documents is one thing, but identify duplication especially when the
duplication is fragmented across files is another. The post talks about
using indexes. Will check it further.

> As for a single html document being split across several files, I had the same 
> situation. My solution was to create a virtual composite of those html files 
> (order was irrelevant) and fed to swish-e -S prog as a single document. That way 
> the PDF vs HTML issue was moot.
>From the swish documentation I got the impression that filters and -S
prog basically achieve the same thing. Am I missing something here? I
haven't experimented with all possible options of swish. Will check it
out further.

(As a side note, changing the OEM doc's directory structure is not
always nice since the documentation will/may have many cross-references
and changing the dir structure will invalidate many of the

thanks and best
Received on Fri Jul 21 20:26:23 2006