Skip to main content.
home | support | download

Back to List Archive

Re: Ranking order based on file extension

From: Shivakumar GN <shivakumar.gn(at)not-real.gmail.com>
Date: Sat Jul 22 2006 - 03:26:14 GMT
On Wed, 2006-07-19 at 20:40 -0700, Peter Karman wrote:
> what you are describing is a situation I have had before. It's not a Swish-e 
> problem; it's a document management problem.
> 
> The solution is actually quite simple: only index one format of each document, 
> then offer the user the opportunity to read it in alternate formats.
> 

Whether the issue can be addressed through document management or
through extensions to the tool depends on the situation. And the
situation I mentioned is one where I would definitely prefer the
solution via tool (either swish or perl scripts).
Mine is a documentation repository that I have created by dumping
documentation from all the OEMs that our product depends upon. The
documentation changes/increases with release of new versions of the
OEMs. I dump the newer set of documentation and create the indexes.
I would prefer not to do "document management" since it will become an
ongoing effort. Instead I would prefer either 
[1] swish provide config facility or 
[2] I adapt the perl backend to get an acceptable functionality
This way it will be one time effort.

> Since you don't manage the document collection, it can be a bit harder to 
> achieve that solution, since the collection may not be organized in such a way 
> that makes it easy to tell if you are looking at mutiple formats of the same 
> document. Here's one solution someone found:
> 
> http://swish-e.org/archive/2005-02/8998.html
> 
Interesting. Often in the kind of documentation I mentioned
1 PDF = Multiple html files
It would be interesting to know if there is an algorithm that can
identify duplicates given a repository of information. Comparing 2
documents is one thing, but identify duplication especially when the
duplication is fragmented across files is another. The post talks about
using indexes. Will check it further.

> As for a single html document being split across several files, I had the same 
> situation. My solution was to create a virtual composite of those html files 
> (order was irrelevant) and fed to swish-e -S prog as a single document. That way 
> the PDF vs HTML issue was moot.
> 
>From the swish documentation I got the impression that filters and -S
prog basically achieve the same thing. Am I missing something here? I
haven't experimented with all possible options of swish. Will check it
out further.

(As a side note, changing the OEM doc's directory structure is not
always nice since the documentation will/may have many cross-references
and changing the dir structure will invalidate many of the
cross-references/links)

thanks and best
Shivakumar
Received on Fri Jul 21 20:26:23 2006