Skip to main content.
home | support | download

Back to List Archive

Re: problems with tolower

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Wed Feb 16 2005 - 16:16:29 GMT
On Wed, Feb 16, 2005 at 06:46:41AM -0800, dasoso@alumni.uv.es wrote:
>  1.-Here ara my locale settings, could be the reason because swish-e     
> indexes ÁRBOL as Árbol?     

Yes, that what I was suggesting.

Swish-e is converting your text to 8858-1 encoding, but you are
telling it to sort using UTF-8.

Run swish like this:

    LANG=es_ES swish-e -c config

Maybe a demonstration will make it clear:

moseley@bumby:~$ LANG=es_ES.UTF-8 swish-e -i t.txt -T indexed_words -v0
    Adding:[1:swishdefault(1)]   'pestaÑa'   Pos:5  Stuct:0x9 ( BODY FILE )
    Adding:[1:swishdefault(1)]   'Águila'   Pos:6  Stuct:0x9 ( BODY FILE )
    Adding:[1:swishdefault(1)]   'águila'   Pos:7  Stuct:0x9 ( BODY FILE )
moseley@bumby:~$ LANG=es_ES swish-e -i t.txt -T indexed_words -v0
    Adding:[1:swishdefault(1)]   'pestaña'   Pos:5  Stuct:0x9 ( BODY FILE )
    Adding:[1:swishdefault(1)]   'águila'   Pos:6  Stuct:0x9 ( BODY FILE )
    Adding:[1:swishdefault(1)]   'águila'   Pos:7  Stuct:0x9 ( BODY FILE )

And you will need to search that way, too -- or at least be
consistent that your locale setting is the same when indexing and
when searching so that tolower() operates the same when when
searching as it does when indexing.  But the bottom line is you don't
want to tell tolower() that it's working with UTF-8 encoding when
it's really working with 8859-1 encoding.


moseley@bumby:~$ LANG=es_ES.UTF-8 swish-e -w PESTAÑA -H9 | grep Parsed
# Parsed Words: pestaÑa 
moseley@bumby:~$ LANG=es_ES swish-e -w PESTAÑA -H9 | grep Parsed
# Parsed Words: pestaña


We could force LANG at program startup, but there's more than one
valid setting (i.e. en_US de_DE es_ES) so we want people to be able
to set that.




-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Wed Feb 16 08:16:30 2005