Skip to main content.
home | support | download

Back to List Archive

Re: libxml2 and non-ascii?]

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Mon Nov 22 2004 - 14:58:10 GMT
On Mon, Nov 22, 2004 at 04:07:13AM -0800, Roman Chyla wrote:
> thank you for the link - I played with configuration, but I am afraid
> the hints from FAQ can't solve my problem in Windows-1250, nor in
> Iso-8859-2 encoding when using libxml2 parser.
> 
> I tried also "TranslateCharacters" option, but since the UTF is 16 bit I
> can not map it to 8bit characters (did I miss something?)

UTF-8 is a variable width format, but yes that's correct, iso-8859-1
is an 8 bit character set and of course cannot represent all the chars
like UTF-8 can.

Since swish-e is 8-bit internally it has to convert to an 8-bit
encoding when reading from libxm2.  (Libxml2 outputs in UTF-8.)
Since libxml2 provides a function to convert to 8859-1 and encoding is
what most users of swish-e have used in the past that encoding
was used.

> perhaps, there could be a new TranslateCharactersUTF directive for users
> with libxml2 and non-8859-2 characters in docs?

I suppose that would be possible.  Currently when there's an encoding
error the character is replaced with a space -- but parser.c could be
hacked to check if the UTF-8 char should be mapped to another UTF-8
character before being encoded in 8859-1.

A rewrite for swish-e to use UTF-8 would be best, of course.  That's
not a new idea.

-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Mon Nov 22 06:58:12 2004