Skip to main content.
home | support | download

Back to List Archive

Re: Problems with sorting German Umlaut

From: Bill Moseley <moseley(at)>
Date: Tue Feb 08 2005 - 01:20:38 GMT
On Sat, Feb 05, 2005 at 02:03:31AM +0100, Andreas Seltenreich wrote:
> I hope I read the source correctly, in that all propstrings have to
> pass through addDocProperty() on their way into the index.

Yes, that's true.  It takes a string or number and turns it into a
"property" that's attached to a file.  It can be called more than once
while indexing a give file for the same property name (which in that
case the string is appended.

> Grepping for it, and looking at what is being fed to it revealed,
> that the propstrings are all \0-Terminated.

The strings passed in are \0 terminated.  The properties
(propEntry->propValue array) are not null-terminated.  If you are
seeing a \0 at the end of the propValue it's just by chance.  Only the
length of the string is copied in memory.

Also, properties can be appended, so you couldn't put the null on the
end until parsing of a given document is complete.

Any sorting is done after parsing, of course, so the null could be
safely added, I would guess.

If you look at CreateProperty() you can see where the docProp is

    docProp=(propEntry *) emalloc(sizeof(propEntry) + propLen);
    memcpy(docProp->propValue, propValue, propLen);
    docProp->propLen = propLen;

sizeof(propEntry) is returning more bytes than needed, so there's
actually room at the end.  So this should work:

    docProp=(propEntry *) emalloc(sizeof(propEntry) + propLen);
    memcpy(docProp->propValue, propValue, propLen);
    docProp->propValue[propLen] = '\0';
    docProp->propLen = propLen;

without any memory cost.  (malloc is probably over allocating anyway
so fine tuning that malloc likely won't save any memory).

Does that make sense to you?  That would avoid the need to copy the
strings in the qsort compare function.

> Out of curiosity I also tried the less invasive approach of
> terminating the non-terminated propstrings inside Compare_Properties()
> using bin2string(), which terminates the propstrings after copying
> them to dynamically allocated memory. The result did surprise me, as
> it was less than 3% slower than the former version.
> Seeing that the penalty of dynamic allocation inside
> Compare_Properties() using alloca() is smaller penalty than the one of
> the switch from strncasecmp()->strcoll(), I am tempted to suggest
> using the latter version, as it is more robust, less invasive and
> easily left in parallel with the old strncasecmp()/strcmp() code.

I'd agree, but how portable is alloca()?  I suppose we could just test
for it in configure when using --enable-strcoll.  But if the propValue
is \0 terminated don't need to bother.

> + #ifndef USE_STRCOLL
> + 
>           rc = is_meta_ignore_case( meta_entry)
>                ? strncasecmp( (char *)p1->propValue, (char *)p2->propValue, len )
>                : strncmp( (char *)p1->propValue, (char *)p2->propValue, len );
> + 
> + #else
> + 	/* strcoll() takes locale dependent collation into account and
> + 	   works with unicode. Sadly, there's no strNcoll(). */
> +         char *str1 = (char *)alloca(len + 1);
> +         char *str2 = (char *)alloca(len + 1);
> + 
> +         memcpy(str1, p1->propValue, len);
> +         str1[len] = '\0';
> +         memcpy(str2, p2->propValue, len);
> +         str2[len] = '\0';
> + 
> +  	rc = strcoll(str1, str2);

BTW -- the text is all 8859-1 at this point.  parser.c could be
modified to use separate buffers for text to index and for text to
store as properties and keep the UTF-8 encoding.

I know you mentioned this before, but what does strcoll do with case?
I wonder what to do with the is_meta_ignore_case test there.

>       setlocale(LC_CTYPE, "");
> + #ifdef USE_STRCOLL
> +     setlocale(LC_COLLATE, "");
> + #endif

Can (should?) that just be:

   setlocale(LC_ALL, "");

Bill Moseley

Unsubscribe from or help with the swish-e list:

Help with Swish-e:
Received on Mon Feb 7 17:20:41 2005