Skip to main content.
home | support | download

Back to List Archive

Re: non-English charaters in XML files

From: <dasoso(at)not-real.alumni.uv.es>
Date: Sat Nov 06 2004 - 14:40:26 GMT
> On Sat, Nov 06, 2004 at 04:35:24AM -0800, dasoso@alumni.uv.es  
wrote:  
> >  Adding:[1:descripcion(18)]   'disenar'   Pos:36  Stuct:0x1  
( FILE )   
> >                                   |   
> >                                  diseñar   
> >    
> >    
> >  Ok, I have indexed all words but without non-English chars.  
But,   
> > why you get indexed diseñar and I index disenar?  
>   
> Because you said in your config file   
>   
>   TranslateCharacters :ascii7:  
>   
> The point of that is so you can use either diseñar or disenar  
> in your query and find the same word.  
>   
  
No Bill, it doesn't work.  
  
  
I tried without the TranslateCharacters :ascii7: in the config file: 
  
  
  
  
UndefinedXMLAttributes auto  
UndefinedMetaTags auto  
  
IndexOnly .xml .html .htm  
  
IndexReport 3  
ParserWarnLevel 9  
  
IndexContents XML* .xml  
IndexContents HTML2 .html .htm  
  
WordCharacters 0123456789abcdefghijklmnñopqrstuvwxyzáéíóúàèòÇ  
  
And tha's what I get:  
  
dsorian@linux:~/swish-e-2.4.2>swish-e -c swish-e.conf -i test.html  
test.xml -T indexed_words  
  
Indexing "test.html"  
  
Checking file "test.html"...  
  test.html - Using HTML2 parser -     Adding:[1:swishdefault(1)]    
'diseño'   Pos:2  Stuct:0x9 ( BODY FILE )  
    Adding:[1:swishdefault(1)]   'señales'   Pos:3  Stuct:0x9 ( BODY  
FILE )  
    Adding:[1:swishdefault(1)]   'niño'   Pos:4  Stuct:0x9 ( BODY  
FILE )  
    Adding:[1:swishdefault(1)]   'perro'   Pos:5  Stuct:0x9 ( BODY  
FILE )  
    Adding:[1:swishdefault(1)]   'leña'   Pos:6  Stuct:0x9 ( BODY  
FILE )  
    Adding:[1:swishdefault(1)]   'piña'   Pos:7  Stuct:0x9 ( BODY  
FILE )  
 (6 words)  
  
  
Indexing "test.xml"  
  
  
  
 Adding:[2:descripcion(18)]   'blah'   Pos:20  Stuct:0x1 ( FILE )  
    Adding:[2:idioma(10)]   'dise'   Pos:25  Stuct:0x1 ( FILE )  
    Adding:[2:curso(12)]   'dise'   Pos:25  Stuct:0x1 ( FILE )  
    Adding:[2:asignatura(14)]   'dise'   Pos:25  Stuct:0x1 ( FILE )  
    Adding:[2:asignatura.nombre(15)]   'dise'   Pos:25  Stuct:0x1  
( FILE )  
    Adding:[2:idioma(10)]   'o'   Pos:26  Stuct:0x1 ( FILE )  
    Adding:[2:curso(12)]   'o'   Pos:26  Stuct:0x1 ( FILE )  
    Adding:[2:asignatura(14)]   'o'   Pos:26  Stuct:0x1 ( FILE )  
    Adding:[2:asignatura.nombre(15)]   'o'   Pos:26  Stuct:0x1  
( FILE )  
    Adding:[2:idioma(10)]   'bases'   Pos:27  Stuct:0x1 ( FILE )  
  
  
  
  
The search for diseño returns test.html only.  
The search for diseño or diseno :  
  
dsorian@linux:~/swish-e-2.4.2> swish-e -w 'asignatura.nombre=diseño'  
# SWISH format: 2.4.2  
# Search words: asignatura.nombre=diseño  
# Removed stopwords:  
err: no results  
  
dsorian@linux:~/swish-e-2.4.2> swish-e -w 'asignatura.nombre=diseno'  
# SWISH format: 2.4.2  
# Search words: asignatura.nombre=diseno  
# Removed stopwords:  
err: no results  
  
And with the TranslateCharacters :ascii7: directive  
  
dsorian@linux:~/swish-e-2.4.2> swish-e -c swish-e.conf -i test.html  
test.xml -T indexed_words  
  
Indexing Data Source: "File-System"  
Indexing "test.html"  
  
Checking file "test.html"...  
  test.html - Using HTML2 parser -     Adding:[1:swishdefault(1)]    
'disea'   Pos:2  Stuct:0x9 ( BODY FILE )  
    Adding:[1:swishdefault(1)]   'o'   Pos:3  Stuct:0x9 ( BODY  
FILE )  
    Adding:[1:swishdefault(1)]   'sea'   Pos:4  Stuct:0x9 ( BODY  
FILE )  
    Adding:[1:swishdefault(1)]   'ales'   Pos:5  Stuct:0x9 ( BODY  
FILE )  
    Adding:[1:swishdefault(1)]   'nia'   Pos:6  Stuct:0x9 ( BODY  
FILE )  
    Adding:[1:swishdefault(1)]   'o'   Pos:7  Stuct:0x9 ( BODY  
FILE )  
    Adding:[1:swishdefault(1)]   'perro'   Pos:8  Stuct:0x9 ( BODY  
FILE )  
    Adding:[1:swishdefault(1)]   'lea'   Pos:9  Stuct:0x9 ( BODY  
FILE )  
    Adding:[1:swishdefault(1)]   'pia'   Pos:10  Stuct:0x9 ( BODY  
FILE )  
 (9 words)  
Indexing "test.xml"  
  
  
  
  
**Adding automatic MetaName 'descripcion' found in file 'test.xml'  
    Adding:[2:idioma(10)]   'blah'   Pos:20  Stuct:0x1 ( FILE )  
    Adding:[2:curso(12)]   'blah'   Pos:20  Stuct:0x1 ( FILE )  
    Adding:[2:asignatura(14)]   'blah'   Pos:20  Stuct:0x1 ( FILE )  
    Adding:[2:descripcion(18)]   'blah'   Pos:20  Stuct:0x1 ( FILE )  
    Adding:[2:idioma(10)]   'diseno'   Pos:25  Stuct:0x1 ( FILE )  
    Adding:[2:curso(12)]   'diseno'   Pos:25  Stuct:0x1 ( FILE )  
    Adding:[2:asignatura(14)]   'diseno'   Pos:25  Stuct:0x1  
( FILE )  
    Adding:[2:asignatura.nombre(15)]   'diseno'   Pos:25  Stuct:0x1  
( FILE )  
    Adding:[2:idioma(10)]   'bases'   Pos:26  Stuct:0x1 ( FILE )  
    Adding:[2:curso(12)]   'bases'   Pos:26  Stuct:0x1 ( FILE )  
    Adding:[2:asignatura(14)]   'bases'   Pos:26  Stuct:0x1 ( FILE )  
    Adding:[2:asignatura.nombre(15)]   'bases'   Pos:26  Stuct:0x1  
( FILE )  
    Adding:[2:idioma(10)]   'datos'   Pos:27  Stuct:0x1 ( FILE )  
  
  
  
    Adding:[2:asignatura(14)]   'optativa'   Pos:33  Stuct:0x1  
( FILE )  
    Adding:[2:tipo(17)]   'optativa'   Pos:33  Stuct:0x1 ( FILE )  
    Adding:[2:idioma(10)]   'disenar'   Pos:36  Stuct:0x1 ( FILE )  
    Adding:[2:curso(12)]   'disenar'   Pos:36  Stuct:0x1 ( FILE )  
    Adding:[2:asignatura(14)]   'disenar'   Pos:36  Stuct:0x1  
( FILE )  
    Adding:[2:descripcion(18)]   'disenar'   Pos:36  Stuct:0x1  
( FILE )  
    Adding:[2:idioma(10)]   'segundo'   Pos:42  Stuct:0x1 ( FILE )  
    Adding:[2:curso(12)]   'segundo'   Pos:42  Stuct:0x1 ( FILE )  
    Adding:[2:curso.numero(13)]   'segundo'   Pos:42  S  
  
And the search for asignatura=diseñar doesn't works.  
 
dsorian@linux:~/swish-e-2.4.2> swish-e -w 'asignatura=disenar' 
# SWISH format: 2.4.2 
# Search words: asignatura=disenar 
# Removed stopwords: 
# Number of hits: 1 
# Search time: 0.001 seconds 
# Run time: 0.022 seconds 
1000 test.xml "test.xml" 683 
 
 
dsorian@linux:~/swish-e-2.4.2> swish-e -w 'asignatura=diseñar' 
# SWISH format: 2.4.2 
# Search words: asignatura=diseñar 
# Removed stopwords: 
err: no results 
 
 And the same for asignatura.nombre=diseno 
  
  So I can't use either diseñar or disenar in the query to find the 
same word. 
 
 
Thank you. 
 
David. 
Received on Sat Nov 6 06:40:30 2004