Skip to main content.
home | support | download

Back to List Archive

Re: Fw: Re: 8-bit chars

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Sat Dec 13 2003 - 20:08:57 GMT
On Sat, Dec 13, 2003 at 06:48:04AM -0800, John Angel wrote:
> Hi Bill,
> 
> What are the chances to implement the following features officially.

Probably not for a while.  Weeks or longer.

> I suggest introducing new attribute e.g. TargetCharset defining in which
> charset will be all documents converted/indexed. Default value should be
> "iso-8859-1" for vertical compatibility.

Yes, that's been on my todo list for a long time.  Just adding iconv
support to parser.c would not be too hard.  It's all the other stuff
that goes along with that that's the issue.  I'm aware that it's the
wrong fix, too, as the only correct way would be to convert to utf-8
internally and that's a huge project, as I said before.

> Beside all 8-bit charsets supported that way, there should be one more
> possible value (e.g. TargetCharset "as-is"), suggesting that documents
> should be indexed exactly in the same encoding as they were originally.

As I said yesterday, that doesn't make sense.  I tried to explain why I
don't think it can work.  Maybe you can explain in detail how it can
work.  

Swish-e is not grepping individual documents when searching, but
is searching a list of words.  To be able to compare the search word
with the words in the index they have to be in the same encoding.


-- 
Bill Moseley
moseley@hank.org
Received on Sat Dec 13 20:09:03 2003