>
> I think I asked already, but are you really indexing UTF-8?
Does't
> look like it. If you are telling libxml2 that you are indexin
UTF-8
> then it's not going to deal with ņ because it's not a UTF-8
character.
Maybe I make something wrong with libxml2? How can I know it?
I installed libxml2 before swish-e and use IndexContents XML2 in the
config file. Anything more?
>
> If you were to look at swish-e's source code in parser.c you would
see
> that the same code is used for processing XML and HTML. So any
> difference is due to the way libxml2 is processing the text. So
it
> appears that libmxm2 is not using the correct encoding for your
xml
> file, but is using an encoding that works for your html file.
>
> Maybe libxml2 assumes 8859-1 for the html.
>
> Now, if you specify 8859-1 for the xml encoding and it still
doesn't
> work then maybe is has to do with your locale settings on you
machine.
I tried the ISO-8859 and look:
dsorian@linux:~/swish-e-2.4.2> cat test2.xml
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE order SYSTEM "pedido.dtd">
<Idioma tipo="Castellano">
<curso numero="quinto">
<asignatura nombre="IPI" codigo="1">
<tipo> Troncal</tipo>
<descripcion> Blah.</descripcion>
</asignatura>
<asignatura nombre="Diseņo de bases de datos" codigo="4">
<tipo> Optativa</tipo>
<descripcion> Diseņar.</descripcion>
</asignatura>
</curso>
<curso numero="segundo">
<asignatura nombre="Base de datos" codigo="2">
<tipo> Troncal </tipo>
<descripcion> </descripcion>
</asignatura>
</curso>
</Idioma>
dsorian@linux:~/swish-e-2.4.2>cat test.xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE order SYSTEM "pedido.dtd">
<Idioma tipo="Castellano">
<curso numero="quinto">
<asignatura nombre="IPI" codigo="1">
<tipo> Troncal</tipo>
<descripcion> Blah.</descripcion>
</asignatura>
<asignatura nombre="Diseņo de bases de datos" codigo="4">
<tipo> Optativa</tipo>
<descripcion> Diseņar.</descripcion>
</asignatura>
</curso>
<curso numero="segundo">
<asignatura nombre="Base de datos" codigo="2">
<tipo> Troncal </tipo>
<descripcion> </descripcion>
</asignatura>
</curso>
</Idioma>
dsorian@linux:~/swish-e-2.4.2> swish-e -c swish-e.conf -T
indexed_words
test.xml - Using XML2 parser
Adding:[1:descripcion(18)] 'blah' Pos:20 Stuct:0x1 ( FILE )
Adding:[1:idioma(10)] 'diseno' Pos:25 Stuct:0x1 ( FILE )
Adding:[1:curso(12)] 'diseno' Pos:25 Stuct:0x1 ( FILE )
Adding:[1:asignatura(14)] 'diseno' Pos:25 Stuct:0x1
( FILE )
Adding:[1:asignatura.nombre(15)] 'diseno' Pos:25 Stuct:0x1
( FILE )
Adding:[1:idioma(10)] 'bases' Pos:26 Stuct:0x1 ( FILE )
Adding:[1:tipo(17)] 'optativa' Pos:33 Stuct:0x1 ( FILE )
Adding:[1:idioma(10)] 'disenar' Pos:36 Stuct:0x1 ( FILE )
Adding:[1:curso(12)] 'disenar' Pos:36 Stuct:0x1 ( FILE )
Adding:[1:asignatura(14)] 'disenar' Pos:36 Stuct:0x1
( FILE )
Adding:[1:descripcion(18)] 'disenar' Pos:36 Stuct:0x1
( FILE )
test2.xml - Using XML2 parser - Adding:[2:idioma(10)]
'castellano' Pos:3 Stuct:0x1 ( FILE )
dding:[2:descripcion(18)] 'blah' Pos:20 Stuct:0x1 ( FILE )
Adding:[2:idioma(10)] 'disea' Pos:25 Stuct:0x1 ( FILE )
Adding:[2:curso(12)] 'disea' Pos:25 Stuct:0x1 ( FILE )
Adding:[2:asignatura(14)] 'disea' Pos:25 Stuct:0x1 ( FILE )
Adding:[2:asignatura.nombre(15)] 'disea' Pos:25 Stuct:0x1
( FILE )
Adding:[2:idioma(10)] 'o' Pos:26 Stuct:0x1 ( FILE )
Adding:[2:curso(12)] 'o' Pos:26 Stuct:0x1 ( FILE )
Adding:[2:asignatura(14)] 'o' Pos:26 Stuct:0x1 ( FILE )
Adding:[2:asignatura.nombre(15)] 'o' Pos:26 Stuct:0x1
( FILE )
Adding:[2:idioma(10)] 'bases' Pos:27 Stuct:0x1 ( FILE )
Adding:[2:curso(12)] 'bases' Pos:27 Stuct:0x1 ( FILE )
Adding:[2:tipo(17)] 'optativa' Pos:34 Stuct:0x1 ( FILE )
Adding:[2:idioma(10)] 'disea' Pos:37 Stuct:0x1 ( FILE )
Adding:[2:curso(12)] 'disea' Pos:37 Stuct:0x1 ( FILE )
Adding:[2:asignatura(14)] 'disea' Pos:37 Stuct:0x1 ( FILE )
Adding:[2:descripcion(18)] 'disea' Pos:37 Stuct:0x1
( FILE )
Adding:[2:idioma(10)] 'ar' Pos:38 Stuct:0x1 ( FILE )
Adding:[2:curso(12)] 'ar' Pos:38 Stuct:0x1 ( FILE )
Adding:[2:asignatura(14)] 'ar' Pos:38 Stuct:0x1 ( FILE )
Adding:[2:descripcion(18)] 'ar' Pos:38 Stuct:0x1 ( FILE )
Adding:[2:idioma(10)] 'segundo' Pos:44 Stuct:0x1 ( FILE )
And look this interesting search :
dsorian@linux:~/swish-e-2.4.2> swish-e -w 'asignatura.nombre=diseņo'
# SWISH format: 2.4.2
# Search words: asignatura.nombre=diseņo
# Removed stopwords:
err: no results
dsorian@linux:~/swish-e-2.4.2> swish-e -w 'asignatura.nombre=diseņ'
# SWISH format: 2.4.2
# Search words: asignatura.nombre=diseņ
# Removed stopwords:
# Number of hits: 1
# Search time: 0.001 seconds
# Run time: 0.022 seconds
1000 /usr/local/jakarta-tomcat-4.1.18-LE-jdk14/webapps/cocoon/webs/borrame/kk/test2.xml
"test2.xml" 676
dsorian@linux:~/swish-e-2.4.2> swish-e -w 'asignatura=diseņar'
# SWISH format: 2.4.2
# Search words: asignatura=diseņar
# Removed stopwords:
err: no results
dsorian@linux:~/swish-e-2.4.2> swish-e -w 'asignatura=diseņ'
# SWISH format: 2.4.2
# Search words: asignatura=diseņ
# Removed stopwords:
# Number of hits: 1
# Search time: 0.001 seconds
# Run time: 0.022 seconds
1000 /usr/local/jakarta-tomcat-4.1.18-LE-jdk14/webapps/cocoon/webs/borrame/kk/test2.xml
"test2.xml" 676
Swish-e splits the words in ISO-8859. I like the way that works with
the UTF-8.
Thank you.
Received on Tue Nov 9 06:53:23 2004