On Wed, Mar 23, 2005 at 10:28:38AM -0800, Carmelo Carchedi wrote:
> I have a tipical xml file like this in utf-8:
>
> maybe the problem is "accented characters".
> If I have accented characters in <testomassima> tag, i cannot find
> any word (with or without accent) in the xml file.
>
> Why?
> is correct to index utf8 files?
It's fine. In fact all documents parsed by libxml2 are in utf8
internally and then converted to 8-bit encoding (namely 8859-1) at
indexing time.
The trick to debugging is index a single file:
swish-e -i test.xml -c swish.config -T indexed_words
That -T indexed_words option will have swish display all the words
that are indexed. Those are the words that you can search for. Make
sure that the entire document is being indexed -- there are cases
where bad XML will make libxml2 abort processing in the middle of a
document.
Then when searching do:
swish-e -w foo -H9 | grep Parsed
and that will show you the word(s) swish is searching for in the
index.
The other thing is set ParserWarnLevel 9 in your config file so that
libxml2 will report any errors in processing.
> it's better to convert utf-8 file in other charset?
Doesn't matter.
--
Bill Moseley
moseley@hank.org
Unsubscribe from or help with the swish-e list:
http://swish-e.org/Discussion/
Help with Swish-e:
http://swish-e.org/current/docs
swish-e@sunsite.berkeley.edu
Received on Wed Mar 23 10:41:54 2005