Skip to main content.
home | support | download

Back to List Archive

Re: Calculating similarity index between html files

From: Peter Karman <peter(at)>
Date: Sun Feb 06 2005 - 14:22:06 GMT
I suppose it depends on what you consider to be 'similar'.

The cat sat on the mat.

The mat sat on the cat.

from an indexing point of view, you might consider those 99% the same. Same 
words, different order.

from a semantic/logical point of view, they communicate something totally 
different, yes?

one thing I might try would be to ignore words with a high Index Frequency. What 
we normally consider StopWords. I would play with the IgnoreWords config setting 
to try that out. That way you could separate the chaff (so to speak) from the 
words that "matter". wrote on 2/6/05 1:31 AM:

> Hi,
> Is there a way to algorithmically calculate the similarity between two
> chunks of html as some sort of index? Perhaps a float value between 0 and 1
> where 1 is exactly the same and 0 is 100% different? I'm trying to remove
> very similar documents from our swish index.
> I'd really appreciate any help you can offer because I've been struggling
> with this for some time.
> Thanks,
> Mark.

Peter Karman  .  .  peter(at)
Received on Sun Feb 6 06:22:12 2005