Skip to main content.
home | support | download

Back to List Archive

Re: Calculating similarity index between html files

From: Peter Karman <peter(at)not-real.peknet.com>
Date: Sun Feb 06 2005 - 14:22:06 GMT
I suppose it depends on what you consider to be 'similar'.

<p>
The cat sat on the mat.
</p>

<p>
The mat sat on the cat.
</p>

from an indexing point of view, you might consider those 99% the same. Same 
words, different order.

from a semantic/logical point of view, they communicate something totally 
different, yes?

one thing I might try would be to ignore words with a high Index Frequency. What 
we normally consider StopWords. I would play with the IgnoreWords config setting 
to try that out. That way you could separate the chaff (so to speak) from the 
words that "matter".

mark@workzoo.com wrote on 2/6/05 1:31 AM:

> Hi,
> 
> Is there a way to algorithmically calculate the similarity between two
> chunks of html as some sort of index? Perhaps a float value between 0 and 1
> where 1 is exactly the same and 0 is 100% different? I'm trying to remove
> very similar documents from our swish index.
> 
> I'd really appreciate any help you can offer because I've been struggling
> with this for some time.
> 
> Thanks,
> 
> Mark.

-- 
Peter Karman  .  http://peknet.com/  .  peter(at)not-real.peknet.com
Received on Sun Feb 6 06:22:12 2005