Skip to main content.
home | support | download

Back to List Archive

Re: Calculating similarity index between html files

From: Mark Maunder <mark(at)>
Date: Tue Feb 08 2005 - 16:53:02 GMT

Agreed on all points. I've managed to get a user defined levenshtein
distance (edit distance) function working under mysql. The source is
It slows down logarithmically as the string length increases. 400 seems
to be the optimal length for comparisons. I plan pre-process the source
data in mysql before passing it to swish via prog as you described. 


On Tue, 2005-02-08 at 06:04 -0800, Peter Karman wrote:
> Mark Maunder wrote on 2/6/05 11:49 AM:
> > An interesting feature in swish might be to have a config option to
> > remove duplicates while indexing. The implementation might calculate the
> > levenshtein distance of each field added to every other field that has a
> > set of predefined MetaNames equal. In other words, it only calculates
> > the LD for all docs that have the same title and base url, for example.
> > Then it only preserves the most recent document of the duplicates.
> That's an interesting theory to play with. I'll have to look into it more. Some 
> derivation might be useful for a ranking scheme.
> However, I think it's beyond the bounds of Swish-e's mission to include that 
> kind of feature on the indexing side. Swish-e does one thing well: index and 
> search files. The more features we add to it, the less likely it will be to do 
> its main job, quickly. Judging from the number of emails on this list about 
> folks using Swish-s to index million+ docs, I think it's already being stretched 
> beyond its original intentions. I'm waiting for the email that says, "I'm using 
> Swish-e to index a billion docs and my machine started dancing around on the 
> table and smoking like a chimney!"
> If you're using -S prog (which does, IIRC), then that sounds like a 
> perfect candidate for a hook or callback to compare docs before passing on to 
> Swish-e to index.
> IMHO, Swish-e should handle whatever you hand to it, quickly, at least up to a 
> (as yet undefined?) scale. What you hand to it, using whatever algorithms you 
> might devise, can (and should?) vary in the application.
Received on Tue Feb 8 08:53:05 2005