Skip to main content.
home | support | download

Back to List Archive

Re: [Fwd: SUBSCRIBE SWISH-E JOHANGRU@KTH.SE]

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Fri Oct 21 2005 - 14:21:04 GMT
On Fri, Oct 21, 2005 at 01:54:31AM -0700, Nikolay A. Panov wrote:
> Hi Johan,
> 
> My cyrillic docs was perfectly indexed by swish-e on Linux system.
> I do not use libxml2 (my docs was indexed as TEXT only), since libxml
> unfortunately cannot work with cyrillic charset (koi8-r, cp1251, etc) now.
> Furthermore, I use stemming_ru for morphology-independent searching...

It's not a problem with libxml2, it's a problem with swish.  libxml2
uses utf-8 internally, and swish-e uses only 8-bit encoding so as an
ugly hack (until swish can be rewritten) swish blindly converts utf-8
into 8859-1.

Not using libxml2 swish just assumes the input data is an 8-bit
encoding and takes whatever data it is given.

Be aware that the non-libxml2 parsers have some problems.  Mostly
minor but may not index as "correctly" as the libxml2 parser.  I don't
remember any more exactly what the difference is.  It's a fun exercise
to index a few docs using both parsers and then compare the words
indexed.

Another approach would be to hack parser.c and replace the 8850-1
conversion with another that converts to whatever 8-bit encoding you
need to work with.

You might also want to check that sorting works like you expect.

-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Fri Oct 21 07:21:13 2005