Skip to main content.
home | support | download

Back to List Archive

Re: Re: HTML Entities

From: David Norris <dave(at)not-real.webaugur.com>
Date: Tue Nov 28 2000 - 10:52:07 GMT
jmruiz@boe.es wrote:
> Probably, it is working this way for historical reasons.

I think you are exactly right.

> Should it be changed?

Eventually, I think the ideal solution is to internally use Unicode/UTF
with a real HTML/XML parser.  I think that would completely solve
character issues.  That's very easy for me to say, though ;-)  Complex
solution, I'm afraid.


It occured to me that using a recode filter may be possible as a short
term hack.  Recode can translate a document to/from HTML entities.  That
would give you consistent entries in the index.  Then just make sure
that searching works as expected.  Numerical entities may cause
problems.

Example (using latin-1):

  recode -d latin-1:html

This treats the input as latin-1 text but -d limits the output
conversion to "diacritic" characters.  This would prevent the HTML
markup from being converted.

  recode -d html:latin-1

This would do the reverse.  Convert HTML entities into latin-1.  This
may be better (correct?).

-- 
,David Norris
  Dave's Web - http://www.webaugur.com/dave/
  Dave's Weather - http://www.webaugur.com/dave/wx
  ICQ Universal Internet Number - 412039
  E-Mail - dave@webaugur.com

"I would never belong to a club that would have me as a member!"
                                          - Groucho Marx
Received on Tue Nov 28 02:50:29 2000