RE: html-entities

From: Bill Moseley <moseley(at)>
Date: Wed Sep 19 2001 - 06:19:42 GMT
At 01:11 AM 09/18/01 -0700, wrote:
>Swish is returning the entities as-is in the documents.
>The normalization of the entities is only be done
>for storing and searching in swish itsself.

I thought we had some debate about this Rainer.  Specifically, I thought
that I had changed to decoding entities when storing properties (including
title).  My memory isn't so clear on the issue, but I think it had
something to do with properties sorting correctly.

Anyway, actually I think that's a bug.  With <meta> properties, the
entities are converted.  It's just that the HTML title extraction code
fails to convert the entities.

If/when we use libxml for parsing HTML, entities will be decoded when
parsing so both words indexed and properties would have entities decoded.
I would think that would be the correct behavior for <title> and <meta>
content stored as a property.  Can you think of situations where that would
not be the correct behavior?

We may want to add an escapeHTML feature to the -x format.

Note that there's a hack in the current html parser that allows something

   <meta name="metaname" content="this is \<b\>bold\</b\>">

That, of course, would not work with the libxml parser.  I don't think that
was ever documented, so it's probably not used much.  (Roy, do you have
many indexes that use that escape method?)

Bill Moseley
