Skip to main content.
home | support | download

Back to List Archive

Re: Fw: Re: 8-bit chars

From: John Angel <angel_john(at)>
Date: Sun Dec 14 2003 - 08:03:26 GMT
> Yes, that's been on my todo list for a long time.  Just adding iconv
> support to parser.c would not be too hard.  It's all the other stuff
> that goes along with that that's the issue.

What other stuff should be modified also?

> > Beside all 8-bit charsets supported that way, there should be one more
> > possible value (e.g. TargetCharset "as-is"), suggesting that documents
> > should be indexed exactly in the same encoding as they were originally.
> As I said yesterday, that doesn't make sense.  I tried to explain why I
> don't think it can work.  Maybe you can explain in detail how it can
> work.

It is the same implementation as for target charset, I don't see why it
shouldn't be done? It makes a lot of sense when you try to index documents
in different languages and encodings.

E.g. try to index a website which is translated in different languages using
several encodings. The results may not be perfect, but "as-is" conversion it
is the best (and the only thing) we can do.

All other open source engines have similar full 8-bit support.

ht://dig has "translate_latin1" attribute for conversion to latin1. If set
to false, it will act as I described - "as-is" conversion.
Received on Sun Dec 14 08:03:35 2003