Peter,
Agreed on all points. I've managed to get a user defined levenshtein
distance (edit distance) function working under mysql. The source is
here:
http://empyrean.lib.ndsu.nodak.edu/~nem/mysql/udf/
It slows down logarithmically as the string length increases. 400 seems
to be the optimal length for comparisons. I plan pre-process the source
data in mysql before passing it to swish via prog as you described.
Mark.
On Tue, 2005-02-08 at 06:04 -0800, Peter Karman wrote:
>
> Mark Maunder wrote on 2/6/05 11:49 AM:
>
> > An interesting feature in swish might be to have a config option to
> > remove duplicates while indexing. The implementation might calculate the
> > levenshtein distance of each field added to every other field that has a
> > set of predefined MetaNames equal. In other words, it only calculates
> > the LD for all docs that have the same title and base url, for example.
> > Then it only preserves the most recent document of the duplicates.
>
> That's an interesting theory to play with. I'll have to look into it more. Some
> derivation might be useful for a ranking scheme.
>
> However, I think it's beyond the bounds of Swish-e's mission to include that
> kind of feature on the indexing side. Swish-e does one thing well: index and
> search files. The more features we add to it, the less likely it will be to do
> its main job, quickly. Judging from the number of emails on this list about
> folks using Swish-s to index million+ docs, I think it's already being stretched
> beyond its original intentions. I'm waiting for the email that says, "I'm using
> Swish-e to index a billion docs and my machine started dancing around on the
> table and smoking like a chimney!"
>
> If you're using -S prog (which spider.pl does, IIRC), then that sounds like a
> perfect candidate for a hook or callback to compare docs before passing on to
> Swish-e to index.
>
> IMHO, Swish-e should handle whatever you hand to it, quickly, at least up to a
> (as yet undefined?) scale. What you hand to it, using whatever algorithms you
> might devise, can (and should?) vary in the application.
>
>
>
Received on Tue Feb 8 08:53:05 2005