Skip to main content.
home | support | download

Back to List Archive

Re: non-English charaters in XML files

From: Bill Moseley <moseley(at)>
Date: Tue Nov 09 2004 - 18:44:44 GMT
On Tue, Nov 09, 2004 at 10:19:05AM -0800, wrote:
> > On Tue, Nov 09, 2004 at 06:52:24AM -0800, wrote:
> > > Swish-e splits the words in ISO-8859. I like the way that works 
> with 
> > > the UTF-8. 
> > 
> > So I guess that means your source xml is encoded in UTF-8.
>   Yes, but I noticed that my server has files encoded in UTF-8 and 
> others in ISO-8859, so I'll have files with 's indexed as n and 
> others whit the words splitted. Anyone has this problem with the xml 
> files? How do you resolve it and index your XML files? Don't know 
> what to do.

You might review ("How is it
implemented?" section).  This part seems to be related to this

    If there is no encoding declaration, then the input has to be in
    either UTF-8 or UTF-16, if it is not then at some point when
    processing the input, the converter/checker of UTF-8 form will
    raise an encoding error. You may end-up with a garbled document,
    or no document at all !

You may need to make sure you xml is well-formed and has the encoding
specified.  You might be able to automate that process (maybe the
file(1) command can help figure out the encoding).

Bill Moseley

Unsubscribe from or help with the swish-e list:

Help with Swish-e:
Received on Tue Nov 9 10:44:45 2004