Skip to main content.
home | support | download

Back to List Archive

Re: Indexing umlauts

From: Peter Karman <peter(at)not-real.peknet.com>
Date: Tue Dec 13 2005 - 14:34:46 GMT
> WHat i need is a way to index utf-8 output correctly, or perhas do a
> substitute in the query...but I will admit I do not know what to
> substitute. My unicode/encoding knowledge is very limited.
>
> I always thought perl worked inernally with utf and it should
> therefore not be a problem.

Perl does work with utf8 in versions 5.6 and later. See
see http://userpage.fu-berlin.de/~ram/pub/pub_jf47htqHHt/perl_unicode_en

However, there are some gotchas with working with utf8 and Perl. If you
are sure that your script is outputing correct utf8 and handing that to
swish-e, keep reading.

>
> I guess my question would be .. when the script goes through and
> indexes the output of catdoc what determines the character encoding?
> Is there a variable i can set to use utf-8 or convert utf-8 to
> iso-8859-1?

be aware: swish-e doesn't store the index in utf8 encoding. If you are
using the libxml2 parser, utf8 is handled correctly but converted to 8859
internally to swish before writing the index. So any multibyte chars are
lost in swish-e. See parser.c for code details.

This is a FAQ and often requested feature. See
http://swish-e.org/devel/index.html#swish3


-- 
Peter Karman . http://peknet.com/ . peter(at)not-real.peknet.com
Received on Tue Dec 13 06:34:47 2005