Skip to main content.
home | support | download

Back to List Archive

Re: Fw: Re: 8-bit chars

From: John Angel <angel_john(at)not-real.hotmail.com>
Date: Thu Dec 11 2003 - 15:09:12 GMT
> > I don't see why we could not convert everything back to original 8-bit
using
> > other function instead of UTF8Toisolat1()? It seems that we even do not
need
> > to know what was the original charset.
>
> Correct.  That's why I was suggesting you could use iconv() in parser.c.
> Might not be that much of a hack to replace the lation1 conversion with
> your windows-1250 conversion.
>
> I have thought about using iconv() in the past but it would require
> other changes to the code to support it in a general way (updates to the
> config process, index header format and the parser) and haven't had the
> time, and also have thought that's a work-around instead of a real fix
> using utf-8 internally would be.  It might be easier to do a complete
> rewrite than to convert to utf-8, though.


The first thing we should do is provide full 8-bit support. Not 7-bit as it
is now.

If just one code tweak (iconv instead of UTF8Toisolat1) will give full 8-bit
support, that should be done immediately.

Full utf-8 support is not a joke and certainly requires a lot more things to
do and test. Should be done carefully, step by step.


> > Regarding tolower(), it should behave the same way - first we convert
> > everything to utf-8, then do the tolower_utf8() and then convert
everything
> > back to 8-bit.
>
> Where's tolower_utf8() defined?  Doing the tolower on the utf-8 is
> possible -- but it's not trivial because the was the input buffer is
> managed, and currently the input text buffer is shared between
> properties and text for indexing -- so those buffers would need to be
> split (don't want to tolower() the properties).


tolower_utf8() is name I have made up, somebody wrote that function
somewhere I guess. I was thinking to replace existing tolower() function
with tolower_utf8(), nothing else. I am not sure it that possible.

If not possible, let's create attributes in conf file where user can specify
8-bit upper and lower case pairs and solve that problem without involving
utf-8 functions. The little problem is that different charsets should have
different pairs, but should not be a problem to implement.

But, this pairs already exist somewhere, we are not first who need it.

Maybe we can use setlocale() on the fly, according to currrent document
charset?

After we implement this, we can say swish-e is fully 8-bit and think about
next major step - utf-8 support. I have tested all other open source engines
and this one is the best. Why not make it even better? :)


> > Of course, search script has to know what is the input charset so it can
> > properly translate the input to utf8. Checkout the parameters when
searching
> > using Google - it does the same. This way we can even introduce full
utf-8
> > support at least for the search script.
>
> What action should swish-e take when converting utf-8 on input and
> there's a conversion failure?

Conversion cannot fail, because we fully support utf-8 with search script.
It will receive input charset as the parameter and convert (or not)
accordingly. If the input charset is utf-8 - we use iconv() to convert it to
8-bit; if input charset is 8-bit - we don't convert chars at all. I hope I
didn't miss any detail.
Received on Thu Dec 11 15:09:22 2003