Re: non-English charaters in XML files

From: Bill Moseley <moseley(at)>
Date: Mon Nov 08 2004 - 05:09:31 GMT
[Sorry if this is a duplicate -- I thought I already responded but I
don't see it on the list..]

On Sun, Nov 07, 2004 at 12:59:56PM -0800, wrote:
> test.xml - Using XML2 parser 
> White-space found word 'Dise?r.'   <--- here too 
>   test.html - Using HTML2 parser - White-space found word 'diseņo' 
> White-space found word 'diseņar' 
> <?xml version="1.0" encoding="UTF-8"?> 

I think I asked already, but are you really indexing UTF-8?  Does't
look like it.  If you are telling libxml2 that you are indexin UTF-8
then it's not going to deal with ņ because it's not a UTF-8 character.

If you were to look at swish-e's source code in parser.c you would see
that the same code is used for processing XML and HTML.  So any
difference is due to the way libxml2 is processing the text.  So it
appears that libmxm2 is not using the correct encoding for your xml
file, but is using an encoding that works for your html file.

Maybe libxml2 assumes 8859-1 for the html.

Now, if you specify 8859-1 for the xml encoding and it still doesn't
work then maybe is has to do with your locale settings on you machine.

Bill Moseley

