Skip to main content.
home | support | download

Back to List Archive

Re: non-English charaters in XML files

From: <dasoso(at)not-real.alumni.uv.es>
Date: Tue Nov 09 2004 - 14:53:23 GMT
  
  
  
  
>   
> I think I asked already, but are you really indexing UTF-8?   
Does't  
> look like it.  If you are telling libxml2 that you are indexin  
UTF-8  
> then it's not going to deal with ņ because it's not a UTF-8  
character.  
  
Maybe I make something wrong with libxml2? How can I know it?  
I installed libxml2 before swish-e and use IndexContents XML2 in the 
config file. Anything more? 
 
 
  
  
>   
> If you were to look at swish-e's source code in parser.c you would  
see  
> that the same code is used for processing XML and HTML.  So any  
> difference is due to the way libxml2 is processing the text.  So  
it  
> appears that libmxm2 is not using the correct encoding for your  
xml  
> file, but is using an encoding that works for your html file.  
>   
> Maybe libxml2 assumes 8859-1 for the html.  
>   
> Now, if you specify 8859-1 for the xml encoding and it still  
doesn't  
> work then maybe is has to do with your locale settings on you  
machine.  
 
 
I tried the ISO-8859 and look: 
 
dsorian@linux:~/swish-e-2.4.2> cat test2.xml 
<?xml version="1.0" encoding="ISO-8859-1"?> 
<!DOCTYPE order SYSTEM "pedido.dtd"> 
<Idioma tipo="Castellano"> 
   <curso numero="quinto"> 
        <asignatura nombre="IPI" codigo="1"> 
            <tipo> Troncal</tipo> 
            <descripcion> Blah.</descripcion> 
        </asignatura> 
 
        <asignatura nombre="Diseņo de bases de datos" codigo="4"> 
            <tipo> Optativa</tipo> 
            <descripcion> Diseņar.</descripcion> 
        </asignatura> 
   </curso> 
 
   <curso numero="segundo"> 
        <asignatura nombre="Base de datos" codigo="2"> 
            <tipo> Troncal </tipo> 
            <descripcion> </descripcion> 
        </asignatura> 
   </curso> 
 
</Idioma> 
 
 
dsorian@linux:~/swish-e-2.4.2>cat test.xml 
<?xml version="1.0" encoding="UTF-8"?> 
<!DOCTYPE order SYSTEM "pedido.dtd"> 
<Idioma tipo="Castellano"> 
   <curso numero="quinto"> 
        <asignatura nombre="IPI" codigo="1"> 
            <tipo> Troncal</tipo> 
            <descripcion> Blah.</descripcion> 
        </asignatura> 
 
        <asignatura nombre="Diseņo de bases de datos" codigo="4"> 
            <tipo> Optativa</tipo> 
            <descripcion> Diseņar.</descripcion> 
        </asignatura> 
   </curso> 
 
   <curso numero="segundo"> 
        <asignatura nombre="Base de datos" codigo="2"> 
            <tipo> Troncal </tipo> 
            <descripcion> </descripcion> 
        </asignatura> 
   </curso> 
 
</Idioma> 
 
dsorian@linux:~/swish-e-2.4.2> swish-e -c swish-e.conf -T 
indexed_words 
 
 
 
test.xml - Using XML2 parser 
 
 
 
Adding:[1:descripcion(18)]   'blah'   Pos:20  Stuct:0x1 ( FILE ) 
    Adding:[1:idioma(10)]   'diseno'   Pos:25  Stuct:0x1 ( FILE ) 
    Adding:[1:curso(12)]   'diseno'   Pos:25  Stuct:0x1 ( FILE ) 
    Adding:[1:asignatura(14)]   'diseno'   Pos:25  Stuct:0x1 
( FILE ) 
    Adding:[1:asignatura.nombre(15)]   'diseno'   Pos:25  Stuct:0x1 
( FILE ) 
    Adding:[1:idioma(10)]   'bases'   Pos:26  Stuct:0x1 ( FILE ) 
 
 
 
   Adding:[1:tipo(17)]   'optativa'   Pos:33  Stuct:0x1 ( FILE ) 
    Adding:[1:idioma(10)]   'disenar'   Pos:36  Stuct:0x1 ( FILE ) 
    Adding:[1:curso(12)]   'disenar'   Pos:36  Stuct:0x1 ( FILE ) 
    Adding:[1:asignatura(14)]   'disenar'   Pos:36  Stuct:0x1 
( FILE ) 
    Adding:[1:descripcion(18)]   'disenar'   Pos:36  Stuct:0x1 
( FILE ) 
 
 
 
test2.xml - Using XML2 parser -     Adding:[2:idioma(10)]   
'castellano'   Pos:3  Stuct:0x1 ( FILE ) 
 
 
 
dding:[2:descripcion(18)]   'blah'   Pos:20  Stuct:0x1 ( FILE ) 
    Adding:[2:idioma(10)]   'disea'   Pos:25  Stuct:0x1 ( FILE ) 
    Adding:[2:curso(12)]   'disea'   Pos:25  Stuct:0x1 ( FILE ) 
    Adding:[2:asignatura(14)]   'disea'   Pos:25  Stuct:0x1 ( FILE ) 
    Adding:[2:asignatura.nombre(15)]   'disea'   Pos:25  Stuct:0x1 
( FILE ) 
    Adding:[2:idioma(10)]   'o'   Pos:26  Stuct:0x1 ( FILE ) 
    Adding:[2:curso(12)]   'o'   Pos:26  Stuct:0x1 ( FILE ) 
    Adding:[2:asignatura(14)]   'o'   Pos:26  Stuct:0x1 ( FILE ) 
    Adding:[2:asignatura.nombre(15)]   'o'   Pos:26  Stuct:0x1 
( FILE ) 
    Adding:[2:idioma(10)]   'bases'   Pos:27  Stuct:0x1 ( FILE ) 
    Adding:[2:curso(12)]   'bases'   Pos:27  Stuct:0x1 ( FILE ) 
 
 
 
   Adding:[2:tipo(17)]   'optativa'   Pos:34  Stuct:0x1 ( FILE ) 
    Adding:[2:idioma(10)]   'disea'   Pos:37  Stuct:0x1 ( FILE ) 
    Adding:[2:curso(12)]   'disea'   Pos:37  Stuct:0x1 ( FILE ) 
    Adding:[2:asignatura(14)]   'disea'   Pos:37  Stuct:0x1 ( FILE ) 
    Adding:[2:descripcion(18)]   'disea'   Pos:37  Stuct:0x1 
( FILE ) 
    Adding:[2:idioma(10)]   'ar'   Pos:38  Stuct:0x1 ( FILE ) 
    Adding:[2:curso(12)]   'ar'   Pos:38  Stuct:0x1 ( FILE ) 
    Adding:[2:asignatura(14)]   'ar'   Pos:38  Stuct:0x1 ( FILE ) 
    Adding:[2:descripcion(18)]   'ar'   Pos:38  Stuct:0x1 ( FILE ) 
    Adding:[2:idioma(10)]   'segundo'   Pos:44  Stuct:0x1 ( FILE ) 
 
 
 
 
 
And look this interesting search : 
 
dsorian@linux:~/swish-e-2.4.2> swish-e -w 'asignatura.nombre=diseņo' 
# SWISH format: 2.4.2 
# Search words: asignatura.nombre=diseņo 
# Removed stopwords: 
err: no results 
 
dsorian@linux:~/swish-e-2.4.2> swish-e -w 'asignatura.nombre=diseņ' 
# SWISH format: 2.4.2 
# Search words: asignatura.nombre=diseņ 
# Removed stopwords: 
# Number of hits: 1 
# Search time: 0.001 seconds 
# Run time: 0.022 seconds 
1000 /usr/local/jakarta-tomcat-4.1.18-LE-jdk14/webapps/cocoon/webs/borrame/kk/test2.xml 
"test2.xml" 676 
 
dsorian@linux:~/swish-e-2.4.2> swish-e -w 'asignatura=diseņar' 
# SWISH format: 2.4.2 
# Search words: asignatura=diseņar 
# Removed stopwords: 
err: no results 
 
dsorian@linux:~/swish-e-2.4.2> swish-e -w 'asignatura=diseņ' 
# SWISH format: 2.4.2 
# Search words: asignatura=diseņ 
# Removed stopwords: 
# Number of hits: 1 
# Search time: 0.001 seconds 
# Run time: 0.022 seconds 
1000 /usr/local/jakarta-tomcat-4.1.18-LE-jdk14/webapps/cocoon/webs/borrame/kk/test2.xml 
"test2.xml" 676 
 
 
Swish-e splits the words in ISO-8859. I like the way that works with 
the UTF-8. 
 
 
Thank you. 
 
Received on Tue Nov 9 06:53:23 2004