Skip to main content.
home | support | download

Back to List Archive

Re: Problems with sorting German Umlaut

From: Andreas Seltenreich <andreas.seltenreich(at)not-real.ubka.uni-karlsruhe.de>
Date: Tue Feb 08 2005 - 10:53:11 GMT
Bill Moseley writes:
> On Sat, Feb 05, 2005 at 02:03:31AM +0100, Andreas Seltenreich wrote:

> The strings passed in are \0 terminated.  The properties
> (propEntry->propValue array) are not null-terminated.  If you are
> seeing a \0 at the end of the propValue it's just by chance.  Only the
> length of the string is copied in memory.

Indeed :-/

> Also, properties can be appended, so you couldn't put the null on the
> end until parsing of a given document is complete.

Right.

> If you look at CreateProperty() you can see where the docProp is
> allocated:
>
>     docProp=(propEntry *) emalloc(sizeof(propEntry) + propLen);
>     memcpy(docProp->propValue, propValue, propLen);
>     docProp->propLen = propLen;
>
> sizeof(propEntry) is returning more bytes than needed, so there's
> actually room at the end.  So this should work:
>
>     docProp=(propEntry *) emalloc(sizeof(propEntry) + propLen);
>     memcpy(docProp->propValue, propValue, propLen);
>     docProp->propValue[propLen] = '\0';
>     docProp->propLen = propLen;

So the same trick in append_property() and we're done? That sounds
almost too easy :-)

>> Seeing that the penalty of dynamic allocation inside
>> Compare_Properties() using alloca() is smaller penalty than the one of
>> the switch from strncasecmp()->strcoll(), I am tempted to suggest
>> using the latter version, as it is more robust, less invasive and
>> easily left in parallel with the old strncasecmp()/strcmp() code.
>
> I'd agree, but how portable is alloca()?  I suppose we could just test
> for it in configure when using --enable-strcoll.

It doesn't seem to be standardised at all. So I guess a fallback with
autoconf to copying with bin2string() would be mandatory.

> I know you mentioned this before, but what does strcoll do with case?
> I wonder what to do with the is_meta_ignore_case test there.

I hope I understood you correctly here. strcoll() is a bit orthogonal
to strcasecmp(). Basically, it is a strcmp(), but with a
locale-dependent order of the characters.

Using LC_COLLATE=C, the characters are collated the same as in ASCII:
ABC...Zabc...z, so strcoll() behaves exactly like strcmp(). By
switching the locale, the sequence differs from ASCII. With en_US
it'll look like this: "AaBbCc..Zz", so the order of the sorted
properties turns out similar to the one of strcasecmp(). Imagine
strcasecmp(s1, s2) as strcmp(tolower(s1), tolower(s2)).

I don't know what the right way would be to deal with the difference.
Maybe instead of a case:ignore flag one should introduce a
collate:<insert locale here> flag, and adjust the locale appropriately
on each document property. So the user would still be able to choose a
per-property collating sequence.

>>       setlocale(LC_CTYPE, "");
>>   
>> + #ifdef USE_STRCOLL
>> +     setlocale(LC_COLLATE, "");
>> + #endif
>
> Can (should?) that just be:
>
>    setlocale(LC_ALL, "");

Theoretically yes, but I don't know the code well enough to decide
that. Setting LC_ALL will switch on locale-awareness for a lot of
other functions. For example it'll also set LC_NUMERIC, which changes,
for example, the output/input format of the printf()/scanf() functions
dependent on the selected locale. I could imagine this could break
some parts of the code, or users' code that depends on
machine-readable output swish-e.

regards,
Andreas
Received on Tue Feb 8 02:53:14 2005