Skip to main content.
home | support | download

Back to List Archive

Re: indexing utf-8 under windows.

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Wed Mar 23 2005 - 18:41:54 GMT
On Wed, Mar 23, 2005 at 10:28:38AM -0800, Carmelo Carchedi wrote:
> I have a tipical xml file like this in utf-8:
> 
> maybe the problem is "accented characters".
> If I have accented characters in <testomassima> tag, i cannot find
> any word (with or without accent) in the xml file.
> 
> Why? 
> is correct to index utf8 files?

It's fine.  In fact all documents parsed by libxml2 are in utf8
internally and then converted to 8-bit encoding (namely 8859-1) at
indexing time.

The trick to debugging is index a single file:

   swish-e -i test.xml -c swish.config -T indexed_words

That -T indexed_words option will have swish display all the words
that are indexed.  Those are the words that you can search for.  Make
sure that the entire document is being indexed -- there are cases
where bad XML will make libxml2 abort processing in the middle of a
document.

Then when searching do:

   swish-e -w foo -H9 | grep Parsed

and that will show you the word(s) swish is searching for in the
index.

The other thing is set ParserWarnLevel 9 in your config file so that
libxml2 will report any errors in processing.


> it's better to convert utf-8 file in other charset?

Doesn't matter.

-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Wed Mar 23 10:41:54 2005