Hi all.
Ok Bill, I commented out Wordcharacters.
dsorian@linux:~/swish-e-2.4.2> cat swish-e.conf
IndexDir /usr/local/jakarta-tomcat-4.1.18-LE-jdk14/webapps/cocoon/webs/borrame/kk
(test.html and test.xml are the only files in the dir)
UndefinedXMLAttributes auto
UndefinedMetaTags auto
IndexOnly .xml .html .htm
IndexReport 3
ParserWarnLevel 9
IndexContents XML* .xml
IndexContents HTML2 .html .htm
TranslateCharacters :ascii7:
#WordCharacters 0123456789abcdefghijklmnñopqrstuvwxyzáéíóúàèòÇ
dsorian@linux:~/swish-e-2.4.2> swish-e -c swish-e.conf -T
indexed_words
Adding:[1:descripcion(18)] 'blah' Pos:20 Stuct:0x1 ( FILE )
Adding:[1:idioma(10)] 'diseno' Pos:25 Stuct:0x1 ( FILE )
Adding:[1:curso(12)] 'diseno' Pos:25 Stuct:0x1 ( FILE )
Adding:[1:asignatura(14)] 'diseno' Pos:25 Stuct:0x1
( FILE )
Adding:[1:asignatura.nombre(15)] 'diseno' Pos:25 Stuct:0x1
( FILE )
Adding:[1:idioma(10)] 'bases' Pos:26 Stuct:0x1 ( FILE )
test.html - Using HTML2 parser - Adding:[2:swishdefault(1)]
'disea' Pos:2 Stuct:0x9 ( BODY FILE )
Adding:[2:swishdefault(1)] 'o' Pos:3 Stuct:0x9 ( BODY
FILE )
Adding:[2:swishdefault(1)] 'disea' Pos:4 Stuct:0x9 ( BODY
FILE )
Adding:[2:swishdefault(1)] 'ar' Pos:5 Stuct:0x9 ( BODY
FILE )
Adding:[2:swishdefault(1)] 'sea' Pos:6 Stuct:0x9 ( BODY
FILE )
Adding:[2:swishdefault(1)] 'ales' Pos:7 Stuct:0x9 ( BODY
FILE )
Adding:[2:swishdefault(1)] 'escoa' Pos:8 Stuct:0x9 ( BODY
FILE )
Adding:[2:swishdefault(1)] 'ado' Pos:9 Stuct:0x9 ( BODY
FILE )
Adding:[2:swishdefault(1)] 'matraz' Pos:10 Stuct:0x9 ( BODY
FILE )
Adding:[2:swishdefault(1)] 'nia' Pos:11 Stuct:0x9 ( BODY
FILE )
Adding:[2:swishdefault(1)] 'o' Pos:12 Stuct:0x9 ( BODY
FILE )
(11 words)
dsorian@linux:~/swish-e-2.4.2> swish-e -c swish-e.conf -T
parsed_words
test.xml - Using XML2 parser
White-space found word 'Blah.'
White-space found word 'Dise?' <--the white blanks appear like a
square char
White-space found word 'de'
White-space found word 'bases'
White-space found word 'de'
White-space found word 'datos'
White-space found word '4'
White-space found word 'Optativa'
White-space found word 'Dise?r.' <--- here too
White-space found word 'segundo'
White-space found word 'Base'
White-space found word 'de'
White-space found word 'datos'
White-space found word '2'
White-space found word 'Troncal'
(17 words)
test.html - Using HTML2 parser - White-space found word 'diseño'
White-space found word 'diseñar'
White-space found word 'señales'
White-space found word 'Escoñado'
White-space found word 'matraz'
White-space found word 'niño'
(11 words)
So the search for diseño in test.html works perfectly thanks to
HTML2.
dsorian@linux:~/swish-e-2.4.2> swish-e -w diseño
# SWISH format: 2.4.2
# Search words: diseño
# Removed stopwords:
# Number of hits: 1
# Search time: 0.001 seconds
# Run time: 0.024 seconds
1000 /usr/local/jakarta-tomcat-4.1.18-LE-jdk14/webapps/cocoon/webs/borrame/kk/test.html
"test.html" 78
dsorian@linux:~/swish-e-2.4.2> swish-e -w 'asignatura.nombre=diseño'
# SWISH format: 2.4.2
# Search words: asignatura.nombre=diseño
# Removed stopwords:
err: no results
dsorian@linux:~/swish-e-2.4.2> swish-e -w 'asignatura.nombre=diseno'
# SWISH format: 2.4.2
# Search words: asignatura.nombre=diseno
# Removed stopwords:
# Number of hits: 1
# Search time: 0.001 seconds
# Run time: 0.023 seconds
1000 /usr/local/jakarta-tomcat-4.1.18-LE-jdk14/webapps/cocoon/webs/borrame/kk/test.xml
"test.xml" 671
It seems, I will not have problems with the search in .html files.
linux:/usr/... # head -1 test.xml
<?xml version="1.0" encoding="UTF-8"?>
You said that the search for diseño and diseno should match, but it
doen't.Why?
Thank you.
David Soriano.
Received on Sun Nov 7 13:03:32 2004