At 01:11 AM 09/18/01 -0700, Rainer.Scherg@rexroth.de wrote:
>Swish is returning the entities as-is in the documents.
>The normalization of the entities is only be done
>for storing and searching in swish itsself.
I thought we had some debate about this Rainer. Specifically, I thought
that I had changed to decoding entities when storing properties (including
title). My memory isn't so clear on the issue, but I think it had
something to do with properties sorting correctly.
Anyway, actually I think that's a bug. With <meta> properties, the
entities are converted. It's just that the HTML title extraction code
fails to convert the entities.
If/when we use libxml for parsing HTML, entities will be decoded when
parsing so both words indexed and properties would have entities decoded.
I would think that would be the correct behavior for <title> and <meta>
content stored as a property. Can you think of situations where that would
not be the correct behavior?
We may want to add an escapeHTML feature to the -x format.
Note that there's a hack in the current html parser that allows something
like:
<meta name="metaname" content="this is \<b\>bold\</b\>">
That, of course, would not work with the libxml parser. I don't think that
was ever documented, so it's probably not used much. (Roy, do you have
many indexes that use that escape method?)
Bill Moseley
mailto:moseley@hank.org
Received on Wed Sep 19 06:22:19 2001