On Thu, Dec 11, 2003 at 07:09:07AM -0800, John Angel wrote:
> The first thing we should do is provide full 8-bit support. Not 7-bit as it
> is now.
It's not 7 bit. It's 8859-1 not ASCII. If you use the HTML parser then
it's basically 8-bit clean with the exception that tolower is used on
those 8-bit clean characters. What tolower does depends on the tolower
function swish-e was linked with.
> If just one code tweak (iconv instead of UTF8Toisolat1) will give full 8-bit
> support, that should be done immediately.
Who's 8-bit are you talking about. My 8-bits are 8859-1 and it works
fine. You are using a Windows-1250 encoding which has characters that
do not map to 8859-1.
You are free to modify parser.c to use iconv and covert back to
Windows-1250, as I suggested. But that won't work for everyone else.
> Full utf-8 support is not a joke and certainly requires a lot more things to
> do and test. Should be done carefully, step by step.
Yes, we should spend a lot of time on it.
> > What action should swish-e take when converting utf-8 on input and
> > there's a conversion failure?
>
> Conversion cannot fail, because we fully support utf-8 with search script.
> It will receive input charset as the parameter and convert (or not)
> accordingly. If the input charset is utf-8 - we use iconv() to convert it to
> 8-bit; if input charset is 8-bit - we don't convert chars at all. I hope I
> didn't miss any detail.
I don't follow. You can't convert utf-8 to "8-bit", you have to convert
to an encoding like 8859-1 or Windows-1250. Those are 8-bit encodings
but, but obviously you can't convert every utf-8 char.
If the index contains words encoded in the 8859-1 character set (or
Windows-1250) and someone submits a query in utf-8 with characters that
don't map to 8859-1 that's a conversion failure.
--
Bill Moseley
moseley@hank.org
Received on Thu Dec 11 18:18:26 2003