Skip to main content.
home | support | download

Back to List Archive

Re: Calculating similarity index between html files

From: Mark Maunder <mark(at)not-real.workzoo.com>
Date: Sun Feb 06 2005 - 17:50:06 GMT
Thanks Peter,

I've discovered the Levenshtein distance algorithm which appears to do
what I need. It calculates the number of 'edits' required to convert one
string to another. I do agree that levenshtein distance and linguistic
meaning are certainly not directly proportional. 

Unfortunately the pure perl implementation of Levenshtein hangs and the
XS (faster) implementation segfaults. I'm experimenting with a mysql
user defined function to parse the blocks of text before they're indexed
by swish. 

An interesting feature in swish might be to have a config option to
remove duplicates while indexing. The implementation might calculate the
levenshtein distance of each field added to every other field that has a
set of predefined MetaNames equal. In other words, it only calculates
the LD for all docs that have the same title and base url, for example.
Then it only preserves the most recent document of the duplicates.





On Sun, 2005-02-06 at 08:20 -0600, Peter Karman wrote:
> I suppose it depends on what you consider to be 'similar'.
> 
> <p>
> The cat sat on the mat.
> </p>
> 
> <p>
> The mat sat on the cat.
> </p>
> 
> from an indexing point of view, you might consider those 99% the same. Same 
> words, different order.
> 
> from a semantic/logical point of view, they communicate something totally 
> different, yes?
> 
> one thing I might try would be to ignore words with a high Index Frequency. What 
> we normally consider StopWords. I would play with the IgnoreWords config setting 
> to try that out. That way you could separate the chaff (so to speak) from the 
> words that "matter".
> 
> mark@workzoo.com wrote on 2/6/05 1:31 AM:
> 
> > Hi,
> > 
> > Is there a way to algorithmically calculate the similarity between two
> > chunks of html as some sort of index? Perhaps a float value between 0 and 1
> > where 1 is exactly the same and 0 is 100% different? I'm trying to remove
> > very similar documents from our swish index.
> > 
> > I'd really appreciate any help you can offer because I've been struggling
> > with this for some time.
> > 
> > Thanks,
> > 
> > Mark.
> 
Received on Sun Feb 6 09:50:13 2005