Skip to main content.
home | support | download

Back to List Archive

Re: non-English charaters in XML files

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Thu Nov 04 2004 - 19:03:20 GMT
> I tried it but I don't see anything odd here's what I get. It seems 
> that every word are indexed. The only problem appears with diseņo 
> that is indexed as diseno: 
> I have the ParserWarmLevel 9 in the config file

Well, your results looks slightly different.

Notice how the parser stops processing in my case:

moseley@bumby:~$ cat test.xml
<?xml version="1.0" standalone="no" ?>
<!DOCTYPE order SYSTEM "pedido.dtd">
<Idioma tipo="Castellano">
   <curso numero="quinto">
        <asignatura nombre="IPI" codigo="1">
            <tipo> Troncal</tipo>
            <descripcion> Blah.</descripcion>
        </asignatura>

        <asignatura nombre="Diseņo de bases de datos" codigo="4">
            <tipo> Optativa</tipo>
            <descripcion> Diseņar.</descripcion>
        </asignatura>
   </curso>

   <curso numero="segundo">
        <asignatura nombre="Base de datos" codigo="2">
            <tipo> Obligatoria </tipo>
            <descripcion> </descripcion>
        </asignatura>
   </curso>

</Idioma>

moseley@bumby:~$ cat c
DefaultContents XML*
ParserWarnLevel 9

moseley@bumby:~$ swish-e -c c -i test.xml -T indexed_words  -v0 -T indexed_words
    Adding:[1:swishdefault(1)]   'troncal'   Pos:6  Stuct:0x1 ( FILE )
    Adding:[1:swishdefault(1)]   'blah'   Pos:9  Stuct:0x1 ( FILE )
test.xml:10: error: Input is not proper UTF-8, indicate encoding !
        <asignatura nombre="Diseņo de bases de datos" codigo="4">
                                ^
test.xml:10: error: Bytes: 0xF1 0x6F 0x20 0x64
        <asignatura nombre="Diseņo de bases de datos" codigo="4">


I think that's why you were not able to search for some words -- they
were never indexed by swish.  In my case, libxml2 thinks it's indexing
UTF-8 -- but ņ is not a valid UTF-8 character so it stops.

But if I do this:

moseley@bumby:~$ head -1 test.xml  
<?xml version="1.0" encoding="iso-8859-1" ?>

moseley@bumby:~$ swish-e -c c -i test.xml -T indexed_words  -v0 -T indexed_words
    Adding:[1:swishdefault(1)]   'troncal'   Pos:6  Stuct:0x1 ( FILE )
    Adding:[1:swishdefault(1)]   'blah'   Pos:9  Stuct:0x1 ( FILE )
    Adding:[1:swishdefault(1)]   'optativa'   Pos:14  Stuct:0x1 ( FILE )
    Adding:[1:swishdefault(1)]   'diseņar'   Pos:17  Stuct:0x1 ( FILE )
    Adding:[1:swishdefault(1)]   'obligatoria'   Pos:24  Stuct:0x1 ( FILE )


Now, for some reason you are not having that problem:

>     Adding:[2:idioma(10)]   'diseno'   Pos:25  Stuct:0x1 ( FILE ) 
>     Adding:[2:curso(12)]   'diseno'   Pos:25  Stuct:0x1 ( FILE ) 
>     Adding:[2:asignatura(14)]   'diseno'   Pos:25  Stuct:0x1 

So are you saying that you cannot search for those words?

>  I want to know if the non-English chars can be indexed correctly in 
> the XML files. 

Libxml2 converts to UTF-8 internally then swish converts to 8859-1
before indexing.  If your "non-English" characters can pass that
conversion, then yes.

-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Thu Nov 4 11:03:21 2004