Skip to main content.
home | support | download

Back to List Archive

Re: Fw: Re: 8-bit chars

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Thu Dec 11 2003 - 14:22:16 GMT
On Thu, Dec 11, 2003 at 04:02:04AM -0800, John Angel wrote:
> I understand that libxml2 converts everything to utf-8.
> 
> I don't see why we could not convert everything back to original 8-bit using
> other function instead of UTF8Toisolat1()? It seems that we even do not need
> to know what was the original charset.

Correct.  That's why I was suggesting you could use iconv() in parser.c.
Might not be that much of a hack to replace the lation1 conversion with 
your windows-1250 conversion.

I have thought about using iconv() in the past but it would require  
other changes to the code to support it in a general way (updates to the 
config process, index header format and the parser) and haven't had the 
time, and also have thought that's a work-around instead of a real fix 
using utf-8 internally would be.  It might be easier to do a complete 
rewrite than to convert to utf-8, though.

> Regarding tolower(), it should behave the same way - first we convert
> everything to utf-8, then do the tolower_utf8() and then convert everything
> back to 8-bit.

Where's tolower_utf8() defined?  Doing the tolower on the utf-8 is
possible -- but it's not trivial because the was the input buffer is
managed, and currently the input text buffer is shared between
properties and text for indexing -- so those buffers would need to be
split (don't want to tolower() the properties).

> Of course, search script has to know what is the input charset so it can
> properly translate the input to utf8. Checkout the parameters when searching
> using Google - it does the same. This way we can even introduce full utf-8
> support at least for the search script.

What action should swish-e take when converting utf-8 on input and 
there's a conversion failure?


-- 
Bill Moseley
moseley@hank.org
Received on Thu Dec 11 14:23:14 2003