Skip to main content.
home | support | download

Back to List Archive

Re: Calculating similarity index between html files

From: Mark Maunder <mark(at)not-real.workzoo.com>
Date: Tue Feb 08 2005 - 16:53:02 GMT
Peter,

Agreed on all points. I've managed to get a user defined levenshtein
distance (edit distance) function working under mysql. The source is
here:
http://empyrean.lib.ndsu.nodak.edu/~nem/mysql/udf/
It slows down logarithmically as the string length increases. 400 seems
to be the optimal length for comparisons. I plan pre-process the source
data in mysql before passing it to swish via prog as you described. 

Mark.

On Tue, 2005-02-08 at 06:04 -0800, Peter Karman wrote:
> 
> Mark Maunder wrote on 2/6/05 11:49 AM:
> 
> > An interesting feature in swish might be to have a config option to
> > remove duplicates while indexing. The implementation might calculate the
> > levenshtein distance of each field added to every other field that has a
> > set of predefined MetaNames equal. In other words, it only calculates
> > the LD for all docs that have the same title and base url, for example.
> > Then it only preserves the most recent document of the duplicates.
> 
> That's an interesting theory to play with. I'll have to look into it more. Some 
> derivation might be useful for a ranking scheme.
> 
> However, I think it's beyond the bounds of Swish-e's mission to include that 
> kind of feature on the indexing side. Swish-e does one thing well: index and 
> search files. The more features we add to it, the less likely it will be to do 
> its main job, quickly. Judging from the number of emails on this list about 
> folks using Swish-s to index million+ docs, I think it's already being stretched 
> beyond its original intentions. I'm waiting for the email that says, "I'm using 
> Swish-e to index a billion docs and my machine started dancing around on the 
> table and smoking like a chimney!"
> 
> If you're using -S prog (which spider.pl does, IIRC), then that sounds like a 
> perfect candidate for a hook or callback to compare docs before passing on to 
> Swish-e to index.
> 
> IMHO, Swish-e should handle whatever you hand to it, quickly, at least up to a 
> (as yet undefined?) scale. What you hand to it, using whatever algorithms you 
> might devise, can (and should?) vary in the application.
> 
> 
> 
Received on Tue Feb 8 08:53:05 2005