Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] Case matching with ISO-8859-9

From: Peter Karman <peter(at)not-real.peknet.com>
Date: Wed May 13 2009 - 04:22:31 GMT
Bill Moseley wrote on 5/12/09 9:38 PM:
> On Tue, May 12, 2009 at 07:05:01PM +0300, Fatih Aytaç wrote:
>> I am indexing iso-8859-9 encoded html files. Swish-e can make searches with
>> iso-8859-9 special chars. But cannot match the lowercase letters with the
>> uppercase ones.
>> The search of "ALL","all","All" words gives me the same results. But search
>> of "ALİ","ali" words gives different result. The lowercase of "İ" is "i" in
>> Turkish. How can I able to make correct lowercase/uppercase match of Turkish
>> characters so that swish-e gives the same results for the words "ALİ" and
>> "ali".
> 
> Swish uses tolower(), IIRC, which should respect locale settings.
> Have you tried setting your locale?  I suspect you would want to do
> that when indexing.
> 

Bill is correct. strtolower() in swstring.c is the function. Here's a little
example code to show:

---------------------snip----------------------------
#include <stdio.h>
#include <stdlib.h>
#include <locale.h>

/* strtolower() from swstring.c in swish-e 2.4.x */
char *
strtolower(char *s)
{
    unsigned char *p = (unsigned char *) s;

    while (*p)
    {
        *p = tolower((unsigned char) *p);
        p++;
    }
    return s;
}


int
main( int argc, char *argv[] )
{
    int i;
    char *str;
    char *loc;

    loc = setlocale(LC_CTYPE, "");
    printf("locale = %s\n", loc);

    for(i=1; i<argc; i++) {
        str = argv[i];
        printf("%s -> ", str);
        printf("%s\n", strtolower(str));
    }

}

---------------------snip----------------------------

[karpet@ira:~/tmp]$  LC_ALL=tr_TR.ISO8859-9 ./strtolower ÏIAÀÁÂ
locale = tr_TR.ISO8859-9
ÏIAÀÁÂ -> ïıaàáâ

[karpet@ira:~/tmp]$  LC_ALL=en_US.ISO8859-1 ./strtolower ÏIAÀÁÂ
locale = en_US.ISO8859-1
ÏIAÀÁÂ -> ïiaàáâ


note that setting the LC_CTYPE is recommended over LC_ALL[0].


[0]http://mail.nl.linux.org/linux-utf8/2001-09/msg00030.html

-- 
Peter Karman  .  http://peknet.com/  .  peter(at)not-real.peknet.com
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Wed May 13 00:23:34 2009