Skip to main content.
home | support | download

Back to List Archive

Re: problems with tolower

From: <dasoso(at)not-real.alumni.uv.es>
Date: Wed Feb 16 2005 - 14:52:06 GMT
     
  Hi all.     
     
     
> man locale or use google.  You machine's locale determines how it     
> sorts, displays money and thousands separator in numbers.     
     
 1.-Here ara my locale settings, could be the reason because swish-e     
indexes ÁRBOL as Árbol?     
     
x:~> locale     
LANG=es_ES.UTF-8     
LC_CTYPE="es_ES.UTF-8"     
LC_NUMERIC="es_ES.UTF-8"     
LC_TIME="es_ES.UTF-8"     
LC_COLLATE="es_ES.UTF-8"     
LC_MONETARY="es_ES.UTF-8"     
LC_MESSAGES="es_ES.UTF-8"     
LC_PAPER="es_ES.UTF-8"     
LC_NAME="es_ES.UTF-8"     
LC_ADDRESS="es_ES.UTF-8"     
LC_TELEPHONE="es_ES.UTF-8"     
LC_MEASUREMENT="es_ES.UTF-8"     
LC_IDENTIFICATION="es_ES.UTF-8"     
LC_ALL=     
     
     
>      
> TranslateCharacters is helpful mostly for English speakers where     
they     
> might want to search for Niño but might type Nino instead.  It     
> probably not what you need.     
>      
     
 2.-Ok, but swish-e indexes ÁRBOL as Árbol and árbol as árbol. And     
would be useful for me if TranslateCharacters works and swish-e     
could index all those words as one word (arbol). Because if I want     
to search arbol I would like Árbol ÁRBOL árbol... in the results     
too.     
     How can I make it works?     
     
 Example:   
   
 cat prueba.html   
   
<html>   
<body>   
arbol   
árbol   
ARBOL   
ÁRBOL   
</body>   
</html>   
   
 cat test.xml   
   
<?xml version="1.0" encoding="ISO-8859-1"?>   
<!DOCTYPE order SYSTEM "pedido.dtd">   
<Idioma tipo="Castellano">   
            <descripcion>   
                arbol   
                árbol   
                ÁRBOL   
                ARBOL   
            </descripcion>   
</Idioma>   
   
 cat test2.xml   
<?xml version="1.0" encoding="ISO-8859-1"?>   
<!DOCTYPE order SYSTEM "pedido.dtd">   
<Idioma tipo="Castellano">   
            <descripcion>   
                arbol   
                ARBOL   
            </descripcion>   
</Idioma>   
   
   
swish-e -c swish-e.conf -T indexed_words   
Indexing Data Source: "File-System"   
Indexing "/home/dsorian/parabuscar/kk/paraelmail"   
Checking dir "/home/dsorian/parabuscar/kk/paraelmail"...   
  prueba.html - Using HTML parser -   
    Adding:[1:swishdefault(1)]   'arbol'   Pos:1  Stuct:0x9 ( BODY   
FILE )   
    Adding:[1:swishdefault(1)]   'árbol'   Pos:2  Stuct:0x9 ( BODY   
FILE )   
    Adding:[1:swishdefault(1)]   'arbol'   Pos:3  Stuct:0x9 ( BODY   
FILE )   
    Adding:[1:swishdefault(1)]   'Árbol'   Pos:4  Stuct:0x9 ( BODY   
FILE )   
 (4 words)   
  test.xml - Using XML2 parser -   
**Adding automatic MetaName 'idioma' found in file /test.xml'   
**Adding automatic MetaName 'idioma.tipo' found in file  
ail/test.xml'   
    Adding:[2:idioma(10)]   'castellano'   Pos:3  Stuct:0x1 ( FILE )   
    Adding:[2:idioma.tipo(11)]   'castellano'   Pos:3  Stuct:0x1   
( FILE )   
  
**Adding automatic MetaName 'descripcion' found in file   
'/home/dsorian/parabuscar/kk/paraelmail/test.xml'   
    Adding:[2:idioma(10)]   'árbol'   Pos:6  Stuct:0x1 ( FILE )   
    Adding:[2:descripcion(12)]   'árbol'   Pos:6  Stuct:0x1 ( FILE )   
    Adding:[2:idioma(10)]   'Árbol'   Pos:7  Stuct:0x1 ( FILE )   
    Adding:[2:descripcion(12)]   'Árbol'   Pos:7  Stuct:0x1 ( FILE )   
 (3 words)   
  test2.xml - Using XML2 parser -   
    Adding:[3:idioma(10)]   'castellano'   Pos:3  Stuct:0x1 ( FILE )   
    Adding:[3:idioma.tipo(11)]   'castellano'   Pos:3  Stuct:0x1   
( FILE )   
    Adding:[3:idioma(10)]   'arbol'   Pos:6  Stuct:0x1 ( FILE )   
    Adding:[3:descripcion(12)]   'arbol'   Pos:6  Stuct:0x1 ( FILE )   
    Adding:[3:idioma(10)]   'arbol'   Pos:7  Stuct:0x1 ( FILE )   
    Adding:[3:descripcion(12)]   'arbol'   Pos:7  Stuct:0x1 ( FILE )   
 (3 words)   
Removing very common words...   
no words removed.   
Writing main index...   
Sorting words ...   
Sorting 4 words alphabetically   
Writing header ...   
Writing index entries ...   
  Writing word text: Complete   
  Writing word hash: Complete   
  Writing word data: Complete   
4 unique words indexed.   
4 properties sorted.   
3 files indexed.  473 total bytes.  16 total words.   
Elapsed time: 00:00:00 CPU time: 00:00:00   
Indexing done!   
   
   
   
 swish-e -k '*' 
# SWISH format: 2.4.3 
index.swish-e: arbol castellano Árbol árbol 
 
I would like translate Á and á as a. I would make better the 
searches. In the next search I want to get test.xml and test2.xml in 
the results. 
 
dsorian@linux:~/swish-e-2.4.3> swish-e -w "idioma=Árbol" 
# SWISH format: 2.4.3 
# Search words: idioma=Árbol 
# Removed stopwords: 
# Number of hits: 1 
# Search time: 0.001 seconds 
# Run time: 0.026 seconds 
1000 /home/dsorian/parabuscar/kk/paraelmail/test.xml "test.xml" 217 
 
 
Thank you and sorry for the big mail :) 
Received on Wed Feb 16 06:52:13 2005