Re: Fw: Re: 8-bit chars

From: John Angel <angel_john(at)>
Date: Sat Dec 13 2003 - 14:48:53 GMT
Hi Bill,

What are the chances to implement the following features officially.

I suggest introducing new attribute e.g. TargetCharset defining in which
charset will be all documents converted/indexed. Default value should be
"iso-8859-1" for vertical compatibility.

E.g. if TargetCharset is "Windows-1250" it should look like this:

1) indexer: iconv(Windows-1250, utf-8) instead of UTF8Toisolat1()

2) indexer: setlocale(Windows-1250) on-the-fly

3) search script: setlocale(Windows-1250) on-the-fly

Beside all 8-bit charsets supported that way, there should be one more
possible value (e.g. TargetCharset "as-is"), suggesting that documents
should be indexed exactly in the same encoding as they were originally.

It looks like this:

1) indexer: iconv(charset_of_the_document_being_indexed, utf-8) instead of

2) indexer: setlocale(charset_of_the_document_being_indexed) on-the-fly

3) search script:
setlocale(charset_provided_as_parameter_of_the_search_script) on-the-fly


From: "Bill Moseley" <>
To: "John Angel" <>
Sent: Thursday, December 11, 2003 21:20
Subject: Re: [SWISH-E] Re: Fw: Re: 8-bit chars

> On Thu, Dec 11, 2003 at 11:23:00AM -0800, John Angel wrote:
> > > You are free to modify parser.c to use iconv and covert back to
> > > Windows-1250, as I suggested.  But that won't work for everyone else.
> > Is it possible to use iconv(charset_of_the_document_being_indexed,
> > instead of UTF8Toisolat1()?
> You mean convert from libxml2's internal utf-8 back to the encoding of
> the original document?  Probably -- I assume there's some way to have
> libxml2 tell you what it was encoding from.
> But that would not work if you have documents of different encodings.
> The index itself has to be one encoding.  That's why I was saying that
> iconv could be used with a configuration setting to say what 8-bit
> encoding to use.
> > > What tolower does depends on the tolower
> > > function swish-e was linked with.
> >
> > setlocale(charset_of_the_document_being_indexed) on-the-fly?
> Well, you want tolower to work for the encoding that the index is
> encoded in.
> Bill Moseley
