Skip to main content.
home | support | download

Back to List Archive

problems with tolower continues again.

From: <dasoso(at)not-real.alumni.uv.es>
Date: Wed Feb 23 2005 - 19:41:00 GMT
   
   
   Hi all.    
  
    I must have something missconfigured in my system about the   
characters set. Because your examples show that swish-e has no  
problem with ñ's etc.   
   
   
swish-e-2.4.3> cat test.xml   
   
españa  PESTAÑA   
niño    NIÑO   
émbolo ÉMBOLO   
   
   
swish-e-2.4.3> LANG=es_ES swish-e -i test.xml -T indexed_words -v0   
   
Adding:[1:swishdefault(1)]   'espa?   Pos:5  Stuct:0x9 ( BODY FILE )   
Adding:[1:swishdefault(1)]   'a'   Pos:6  Stuct:0x9 ( BODY FILE )   
Adding:[1:swishdefault(1)]   'pesta?   Pos:7  Stuct:0x9 ( BODY   
FILE )   
Adding:[1:swishdefault(1)]   'a'   Pos:8  Stuct:0x9 ( BODY FILE )   
Adding:[1:swishdefault(1)]   'ni?   Pos:9  Stuct:0x9 ( BODY FILE )   
Adding:[1:swishdefault(1)]   'o'   Pos:10  Stuct:0x9 ( BODY FILE )   
Adding:[1:swishdefault(1)]   'ni?   Pos:11  Stuct:0x9 ( BODY FILE )   
Adding:[1:swishdefault(1)]   'o'   Pos:12  Stuct:0x9 ( BODY FILE )   
Adding:[1:swishdefault(1)]   '?   Pos:13  Stuct:0x9 ( BODY FILE )   
Adding:[1:swishdefault(1)]   'mbolo'   Pos:14  Stuct:0x9 ( BODY   
FILE )   
Adding:[1:swishdefault(1)]   '?   Pos:15  Stuct:0x9 ( BODY FILE )   
Adding:[1:swishdefault(1)]   'mbolo'   Pos:16  Stuct:0x9 ( BODY   
FILE )   
   
   
swish-e-2.4.3> swish-e -k '*'   
   
# SWISH format: 2.4.3   
index.swish-e: a espa?mbolo ni?o pesta??   
   
   
David.   
   
   
  
  
   
   
   
> On Wed, Feb 16, 2005 at 06:46:41AM -0800, dasoso@alumni.uv.es   
wrote:   
> >  1.-Here ara my locale settings, could be the reason because   
swish-e        
> > indexes ÁRBOL as Árbol?        
>    
> Yes, that what I was suggesting.   
>    
> Swish-e is converting your text to 8858-1 encoding, but you are   
> telling it to sort using UTF-8.   
>    
> Run swish like this:   
>    
>     LANG=es_ES swish-e -c config   
>    
> Maybe a demonstration will make it clear:   
>    
> moseley@bumby:~$ LANG=es_ES.UTF-8 swish-e -i t.txt -T   
indexed_words -v0   
>     Adding:[1:swishdefault(1)]   'pestaÑa'   Pos:5  Stuct:0x9   
( BODY FILE )   
>     Adding:[1:swishdefault(1)]   'Águila'   Pos:6  Stuct:0x9   
( BODY FILE )   
>     Adding:[1:swishdefault(1)]   'águila'   Pos:7  Stuct:0x9   
( BODY FILE )   
> moseley@bumby:~$ LANG=es_ES swish-e -i t.txt -T indexed_words -v0   
>     Adding:[1:swishdefault(1)]   'pestaña'   Pos:5  Stuct:0x9   
( BODY FILE )   
>     Adding:[1:swishdefault(1)]   'águila'   Pos:6  Stuct:0x9   
( BODY FILE )   
>     Adding:[1:swishdefault(1)]   'águila'   Pos:7  Stuct:0x9   
( BODY FILE )   
>    
> And you will need to search that way, too -- or at least be   
> consistent that your locale setting is the same when indexing and   
> when searching so that tolower() operates the same when when   
> searching as it does when indexing.  But the bottom line is you   
don't   
> want to tell tolower() that it's working with UTF-8 encoding when   
> it's really working with 8859-1 encoding.   
>    
>    
> moseley@bumby:~$ LANG=es_ES.UTF-8 swish-e -w PESTAÑA -H9 | grep   
Parsed   
> # Parsed Words: pestaÑa    
> moseley@bumby:~$ LANG=es_ES swish-e -w PESTAÑA -H9 | grep Parsed   
> # Parsed Words: pestaña   
>    
>    
> We could force LANG at program startup, but there's more than one   
> valid setting (i.e. en_US de_DE es_ES) so we want people to be   
able   
> to set that.   
>    
>    
>    
>    
> --    
> Bill Moseley   
> moseley@hank.org   
>    
> Unsubscribe from or help with the swish-e list:    
>    http://swish-e.org/Discussion/   
>    
> Help with Swish-e:   
>    http://swish-e.org/current/docs   
>    swish-e@sunsite.berkeley.edu   
>    
>    
   
   
Received on Wed Feb 23 11:41:06 2005