Skip to main content.
home | support | download

Back to List Archive

Re: Encoding problems

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Wed Jul 10 2002 - 13:27:03 GMT
At 05:38 AM 07/10/02 -0700, Kristaps Erglis wrote:
>I'm using SWISH-E to index HTML files with some XML like <properties> for
>files with cp-1257 (Baltic) character set.
>
>It's no problem if settings are
>
>'IndexContents HTML .htm .html'
>
>but then 
>
>PropertyNames doesn't work.

I don't know much about character encodings -- and most of swish only knows
about 8-bit chars.  And without an example from you it's only a guess.

That might be due to the way the C library is converting to lower case, and
the way it converts to lower case should be locale setting dependent.  Is
is possible that your property name in your docs are being converted to
lower case and then not matching your property settings in the config?
Using -T indexed_words should give you an idea of how words are being
converted to lower case.

>When I change to 
>
>'IndexContents HTML2 .htm .html'
>
>libxml2 (libxml2.so.2.4.23)
>
>Properties are OK but all words containing special letters with diacritical
>symbols "just dies".

I fear "just dies" is not descriptive enough.  As you found out from using
the ParserWarnLevel:

>Failed to convert internal UTF-8 to Latin-1.
>Replacing non ISO-8859-1 char with char ' '

Libxml2 correctly understands encodings and converts everything to UTF-8
internally.  

Swish, on the other hand, only knows 8-bit chars.  Libxml2 includes a
utility function called UTF8Toisolat1() which is used in swish to map chars
back to 8 bit.  That will probably break docs that are not 8859-1 encoded.

Chars it cannot convert are indexed as a space -- and that will cause swish
to split that word into two (or more) "words" since a space is not a
wordcharacter.

If anyone has a better way to do this map from UTF-8 to an 8 bit character
set defined in the config file please speak up.  Need a function a bit more
useful than UTF8Toisolat1().

The correct solution would be to rewrite swish to work with UTF-8.  I can't
see that happening anytime soon.

-- 
Bill Moseley
mailto:moseley@hank.org
Received on Wed Jul 10 13:30:34 2002