Hi Bill,
Specific example is if you try to index word containing ASCII 154 char.
I have added it in WordCharacters, but it provides no results.
I assume the problem is conversion to Latin1 - as you said, it is not 100%
8-bit clean. Is there some other function we could use to translate UTF-8 to
8-bit chars, instead of UTF8Toisolat1()? Or even better - to completely
avoid conversion to UTF-8, but to leave every char as it is originally.
Regards,
John
> > I have added chars above ASCII 127 to WordCharacters but it still
>displays
> > blanks instead of them. Where's the catch?
>
>You need to give an example of what's not working.
>
> > BTW, I have noticed that in WordCharacters there are only small caps
>chars.
>
>Yes, words are lowercased with "tolower()" as you noticed. So only
>lower case need to be specified.
>
> > UTF-8 support would be great, but I understand it requires major
>rewrite. Is
> > it possible to have at least full 8-bit chars support instead?
>
>It is full 8-bit, but there's a conversion to Latin1 when using libxml2
>so it may not be 100% 8-bit "clean". I have not tested that with
>libxml2.
>
>BTW - First thing swish-e does when starting is:
>
> setlocale(LC_CTYPE, "");
>
>but that's only in the binary. (So that might result in problems when
>people use the Swish-e API on systems with different locales -- that is,
>tolower() might not change umlauts on indexing but would on searching.q
>
> > Searching through previous posts shows that the problem could be in
> > UTF8Toisolat1() and tolower() functions, but I am not sure how to change
>and
> > fix that.
>
>Can you provide a specific example of the problem?
_________________________________________________________________
STOP MORE SPAM with the new MSN 8 and get 2 months FREE*
http://join.msn.com/?page=features/junkmail
Received on Sat Dec 6 11:21:25 2003