Re: [swish-e] Case matching with ISO-8859-9

From: Peter Karman <peter(at)>
Date: Wed May 13 2009 - 04:22:31 GMT
Bill Moseley wrote on 5/12/09 9:38 PM:
> On Tue, May 12, 2009 at 07:05:01PM +0300, Fatih Aytaç wrote:
>> I am indexing iso-8859-9 encoded html files. Swish-e can make searches with
>> iso-8859-9 special chars. But cannot match the lowercase letters with the
>> uppercase ones.
>> The search of "ALL","all","All" words gives me the same results. But search
>> of "ALİ","ali" words gives different result. The lowercase of "İ" is "i" in
>> Turkish. How can I able to make correct lowercase/uppercase match of Turkish
>> characters so that swish-e gives the same results for the words "ALİ" and
>> "ali".
> Swish uses tolower(), IIRC, which should respect locale settings.
> Have you tried setting your locale?  I suspect you would want to do
> that when indexing.

Bill is correct. strtolower() in swstring.c is the function. Here's a little
example code to show:

#include <stdio.h>
#include <stdlib.h>
#include <locale.h>

/* strtolower() from swstring.c in swish-e 2.4.x */
char *
strtolower(char *s)
    unsigned char *p = (unsigned char *) s;

    while (*p)
        *p = tolower((unsigned char) *p);
    return s;

main( int argc, char *argv[] )
    int i;
    char *str;
    char *loc;

    loc = setlocale(LC_CTYPE, "");
    printf("locale = %s\n", loc);

    for(i=1; i<argc; i++) {
        str = argv[i];
        printf("%s -> ", str);
        printf("%s\n", strtolower(str));



[karpet@ira:~/tmp]$  LC_ALL=tr_TR.ISO8859-9 ./strtolower ÏIAÀÁÂ
locale = tr_TR.ISO8859-9
ÏIAÀÁÂ -> ïıaàáâ

[karpet@ira:~/tmp]$  LC_ALL=en_US.ISO8859-1 ./strtolower ÏIAÀÁÂ
locale = en_US.ISO8859-1
ÏIAÀÁÂ -> ïiaàáâ

note that setting the LC_CTYPE is recommended over LC_ALL[0].


Peter Karman
