Mark Maunder wrote on 2/6/05 11:49 AM:
> An interesting feature in swish might be to have a config option to
> remove duplicates while indexing. The implementation might calculate the
> levenshtein distance of each field added to every other field that has a
> set of predefined MetaNames equal. In other words, it only calculates
> the LD for all docs that have the same title and base url, for example.
> Then it only preserves the most recent document of the duplicates.
That's an interesting theory to play with. I'll have to look into it more. Some
derivation might be useful for a ranking scheme.
However, I think it's beyond the bounds of Swish-e's mission to include that
kind of feature on the indexing side. Swish-e does one thing well: index and
search files. The more features we add to it, the less likely it will be to do
its main job, quickly. Judging from the number of emails on this list about
folks using Swish-s to index million+ docs, I think it's already being stretched
beyond its original intentions. I'm waiting for the email that says, "I'm using
Swish-e to index a billion docs and my machine started dancing around on the
table and smoking like a chimney!"
If you're using -S prog (which spider.pl does, IIRC), then that sounds like a
perfect candidate for a hook or callback to compare docs before passing on to
Swish-e to index.
IMHO, Swish-e should handle whatever you hand to it, quickly, at least up to a
(as yet undefined?) scale. What you hand to it, using whatever algorithms you
might devise, can (and should?) vary in the application.
--
Peter Karman . http://peknet.com/ . peter(at)not-real.peknet.com
Received on Tue Feb 8 06:05:06 2005