Skip to main content.
home | support | download

Back to List Archive

Re: 8-bit chars

From: John Angel <angel_john(at)>
Date: Sat Dec 06 2003 - 11:21:19 GMT
Hi Bill,

Specific example is if you try to index word containing ASCII 154 char.

I have added it in WordCharacters, but it provides no results.

I assume the problem is conversion to Latin1 - as you said, it is not 100% 
8-bit clean. Is there some other function we could use to translate UTF-8 to 
8-bit chars, instead of UTF8Toisolat1()? Or even better - to completely 
avoid conversion to UTF-8, but to leave every char as it is originally.


> > I have added chars above ASCII 127 to WordCharacters but it still 
> > blanks instead of them. Where's the catch?
>You need to give an example of what's not working.
> > BTW, I have noticed that in WordCharacters there are only small caps 
>Yes, words are lowercased with "tolower()" as you noticed.  So only
>lower case need to be specified.
> > UTF-8 support would be great, but I understand it requires major 
>rewrite. Is
> > it possible to have at least full 8-bit chars support instead?
>It is full 8-bit, but there's a conversion to Latin1 when using libxml2
>so it may not be 100% 8-bit "clean".  I have not tested that with
>BTW - First thing swish-e does when starting is:
>       setlocale(LC_CTYPE, "");
>but that's only in the binary.  (So that might result in problems when
>people use the Swish-e API on systems with different locales -- that is,
>tolower() might not change umlauts on indexing but would on searching.q
> > Searching through previous posts shows that the problem could be in
> > UTF8Toisolat1() and tolower() functions, but I am not sure how to change 
> > fix that.
>Can you provide a specific example of the problem?

STOP MORE SPAM with the new MSN 8 and get 2 months FREE*
Received on Sat Dec 6 11:21:25 2003