> I tried it but I don't see anything odd here's what I get. It seems
> that every word are indexed. The only problem appears with diseņo
> that is indexed as diseno:
> I have the ParserWarmLevel 9 in the config file
Well, your results looks slightly different.
Notice how the parser stops processing in my case:
moseley@bumby:~$ cat test.xml
<?xml version="1.0" standalone="no" ?>
<!DOCTYPE order SYSTEM "pedido.dtd">
<Idioma tipo="Castellano">
<curso numero="quinto">
<asignatura nombre="IPI" codigo="1">
<tipo> Troncal</tipo>
<descripcion> Blah.</descripcion>
</asignatura>
<asignatura nombre="Diseņo de bases de datos" codigo="4">
<tipo> Optativa</tipo>
<descripcion> Diseņar.</descripcion>
</asignatura>
</curso>
<curso numero="segundo">
<asignatura nombre="Base de datos" codigo="2">
<tipo> Obligatoria </tipo>
<descripcion> </descripcion>
</asignatura>
</curso>
</Idioma>
moseley@bumby:~$ cat c
DefaultContents XML*
ParserWarnLevel 9
moseley@bumby:~$ swish-e -c c -i test.xml -T indexed_words -v0 -T indexed_words
Adding:[1:swishdefault(1)] 'troncal' Pos:6 Stuct:0x1 ( FILE )
Adding:[1:swishdefault(1)] 'blah' Pos:9 Stuct:0x1 ( FILE )
test.xml:10: error: Input is not proper UTF-8, indicate encoding !
<asignatura nombre="Diseņo de bases de datos" codigo="4">
^
test.xml:10: error: Bytes: 0xF1 0x6F 0x20 0x64
<asignatura nombre="Diseņo de bases de datos" codigo="4">
I think that's why you were not able to search for some words -- they
were never indexed by swish. In my case, libxml2 thinks it's indexing
UTF-8 -- but ņ is not a valid UTF-8 character so it stops.
But if I do this:
moseley@bumby:~$ head -1 test.xml
<?xml version="1.0" encoding="iso-8859-1" ?>
moseley@bumby:~$ swish-e -c c -i test.xml -T indexed_words -v0 -T indexed_words
Adding:[1:swishdefault(1)] 'troncal' Pos:6 Stuct:0x1 ( FILE )
Adding:[1:swishdefault(1)] 'blah' Pos:9 Stuct:0x1 ( FILE )
Adding:[1:swishdefault(1)] 'optativa' Pos:14 Stuct:0x1 ( FILE )
Adding:[1:swishdefault(1)] 'diseņar' Pos:17 Stuct:0x1 ( FILE )
Adding:[1:swishdefault(1)] 'obligatoria' Pos:24 Stuct:0x1 ( FILE )
Now, for some reason you are not having that problem:
> Adding:[2:idioma(10)] 'diseno' Pos:25 Stuct:0x1 ( FILE )
> Adding:[2:curso(12)] 'diseno' Pos:25 Stuct:0x1 ( FILE )
> Adding:[2:asignatura(14)] 'diseno' Pos:25 Stuct:0x1
So are you saying that you cannot search for those words?
> I want to know if the non-English chars can be indexed correctly in
> the XML files.
Libxml2 converts to UTF-8 internally then swish converts to 8859-1
before indexing. If your "non-English" characters can pass that
conversion, then yes.
--
Bill Moseley
moseley@hank.org
Unsubscribe from or help with the swish-e list:
http://swish-e.org/Discussion/
Help with Swish-e:
http://swish-e.org/current/docs
swish-e@sunsite.berkeley.edu
Received on Thu Nov 4 11:03:21 2004