Re: Re: HTML Entities

From: David Norris <dave(at)>
Date: Tue Nov 28 2000 - 10:52:07 GMT wrote:
> Probably, it is working this way for historical reasons.

I think you are exactly right.

> Should it be changed?

Eventually, I think the ideal solution is to internally use Unicode/UTF
with a real HTML/XML parser.  I think that would completely solve
character issues.  That's very easy for me to say, though ;-)  Complex
solution, I'm afraid.

It occured to me that using a recode filter may be possible as a short
term hack.  Recode can translate a document to/from HTML entities.  That
would give you consistent entries in the index.  Then just make sure
that searching works as expected.  Numerical entities may cause

Example (using latin-1):

  recode -d latin-1:html

This treats the input as latin-1 text but -d limits the output
conversion to "diacritic" characters.  This would prevent the HTML
markup from being converted.

  recode -d html:latin-1

This would do the reverse.  Convert HTML entities into latin-1.  This
may be better (correct?).

