Skip to main content.
home | support | download

Back to List Archive

Re: non-English charaters in XML files

From: <dasoso(at)not-real.alumni.uv.es>
Date: Sat Nov 06 2004 - 12:37:02 GMT
 
dsorian@linux:~/swish-e-2.4.2> head -1 test.xml 
<?xml version="1.0" encoding="UTF-8"?> 
 
 
dsorian@linux:~/swish-e-2.4.2>swish-e -c swish-e.conf -i test.html 
test.xml -T indexed_words 
 
 
Adding:[1:descripcion(18)]   'blah'   Pos:20  Stuct:0x1 ( FILE ) 
Adding:[1:idioma(10)]   'diseno'   Pos:25  Stuct:0x1 ( FILE ) 
Adding:[1:curso(12)]   'diseno'   Pos:25  Stuct:0x1 ( FILE ) 
Adding:[1:asignatura(14)]   'diseno'   Pos:25  Stuct:0x1 ( FILE ) 
Adding:[1:asignatura.nombre(15)]  'diseno' Pos:25  Stuct:0x1( FILE ) 
  .                                 | 
  .                                 diseņo 
  . 
 Adding:[1:tipo(17)]   'optativa'   Pos:33  Stuct:0x1 ( FILE ) 
    Adding:[1:idioma(10)]   'disenar'   Pos:36  Stuct:0x1 ( FILE ) 
    Adding:[1:curso(12)]   'disenar'   Pos:36  Stuct:0x1 ( FILE ) 
    Adding:[1:asignatura(14)]   'disenar'   Pos:36  Stuct:0x1 
( FILE ) 
 Adding:[1:descripcion(18)]   'disenar'   Pos:36  Stuct:0x1 ( FILE ) 
                                  | 
                                 diseņar 
 
 
 Ok, I have indexed all words but without non-English chars. But, 
why you get indexed diseņar and I index disenar? My libxml2 is 
libxml2-2.6.11-1  
 
 If it's impossible to index ņ for the UTF translation of the 
libxml2 I'll translate the queries of the users. If they want to 
search, for example, 'asignatura=diseņo' I'll translate it to 
'asignatura=diseno' , I think is the only way for me to make it 
work. 
 
 
Thank you.  
 
> I think that's why you were not able to search for some words -- 
they 
> were never indexed by swish.  In my case, libxml2 thinks it's 
indexing 
> UTF-8 -- but ņ is not a valid UTF-8 character so it stops. 
>  
> But if I do this: 
>  
> moseley@bumby:~$ head -1 test.xml   
> <?xml version="1.0" encoding="iso-8859-1" ?> 
>  
> moseley@bumby:~$ swish-e -c c -i test.xml -T indexed_words  -v0 -T 
indexed_words 
>     Adding:[1:swishdefault(1)]   'troncal'   Pos:6  Stuct:0x1 
( FILE ) 
>     Adding:[1:swishdefault(1)]   'blah'   Pos:9  Stuct:0x1 
( FILE ) 
>     Adding:[1:swishdefault(1)]   'optativa'   Pos:14  Stuct:0x1 
( FILE ) 
>     Adding:[1:swishdefault(1)]   'diseņar'   Pos:17  Stuct:0x1 
( FILE ) 
>     Adding:[1:swishdefault(1)]   'obligatoria'   Pos:24  Stuct:0x1 
( FILE ) 
>  
>  
> Now, for some reason you are not having that problem: 
>  
> >     Adding:[2:idioma(10)]   'diseno'   Pos:25  Stuct:0x1 
( FILE )  
> >     Adding:[2:curso(12)]   'diseno'   Pos:25  Stuct:0x1 ( FILE )  
> >     Adding:[2:asignatura(14)]   'diseno'   Pos:25  Stuct:0x1  
>  
> So are you saying that you cannot search for those words? 
>  
> >  I want to know if the non-English chars can be indexed 
correctly in  
> > the XML files.  
>  
> Libxml2 converts to UTF-8 internally then swish converts to 8859-1 
> before indexing.  If your "non-English" characters can pass that 
> conversion, then yes. 
>  
 
 
 
Received on Sat Nov 6 04:37:06 2004