On Fri, Oct 21, 2005 at 01:54:31AM -0700, Nikolay A. Panov wrote:
> Hi Johan,
>
> My cyrillic docs was perfectly indexed by swish-e on Linux system.
> I do not use libxml2 (my docs was indexed as TEXT only), since libxml
> unfortunately cannot work with cyrillic charset (koi8-r, cp1251, etc) now.
> Furthermore, I use stemming_ru for morphology-independent searching...
It's not a problem with libxml2, it's a problem with swish. libxml2
uses utf-8 internally, and swish-e uses only 8-bit encoding so as an
ugly hack (until swish can be rewritten) swish blindly converts utf-8
into 8859-1.
Not using libxml2 swish just assumes the input data is an 8-bit
encoding and takes whatever data it is given.
Be aware that the non-libxml2 parsers have some problems. Mostly
minor but may not index as "correctly" as the libxml2 parser. I don't
remember any more exactly what the difference is. It's a fun exercise
to index a few docs using both parsers and then compare the words
indexed.
Another approach would be to hack parser.c and replace the 8850-1
conversion with another that converts to whatever 8-bit encoding you
need to work with.
You might also want to check that sorting works like you expect.
--
Bill Moseley
moseley@hank.org
Unsubscribe from or help with the swish-e list:
http://swish-e.org/Discussion/
Help with Swish-e:
http://swish-e.org/current/docs
swish-e@sunsite.berkeley.edu
Received on Fri Oct 21 07:21:13 2005