On Tue, Nov 09, 2004 at 10:19:05AM -0800, dasoso@alumni.uv.es wrote:
>
>
> > On Tue, Nov 09, 2004 at 06:52:24AM -0800, dasoso@alumni.uv.es wrote:
> > > Swish-e splits the words in ISO-8859. I like the way that works
> with
> > > the UTF-8.
> >
> > So I guess that means your source xml is encoded in UTF-8.
>
>
> Yes, but I noticed that my server has files encoded in UTF-8 and
> others in ISO-8859, so I'll have files with ñ's indexed as n and
> others whit the words splitted. Anyone has this problem with the xml
> files? How do you resolve it and index your XML files? Don't know
> what to do.
You might review http://xmlsoft.org/encoding.html ("How is it
implemented?" section). This part seems to be related to this
discussion.
If there is no encoding declaration, then the input has to be in
either UTF-8 or UTF-16, if it is not then at some point when
processing the input, the converter/checker of UTF-8 form will
raise an encoding error. You may end-up with a garbled document,
or no document at all !
You may need to make sure you xml is well-formed and has the encoding
specified. You might be able to automate that process (maybe the
file(1) command can help figure out the encoding).
--
Bill Moseley
moseley@hank.org
Unsubscribe from or help with the swish-e list:
http://swish-e.org/Discussion/
Help with Swish-e:
http://swish-e.org/current/docs
swish-e@sunsite.berkeley.edu
Received on Tue Nov 9 10:44:45 2004