Skip to main content.
home | support | download

Back to List Archive

Re: Problems with sorting German Umlaut

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Tue Feb 08 2005 - 15:46:37 GMT
On Tue, Feb 08, 2005 at 11:50:08AM +0100, Andreas Seltenreich wrote:
> > sizeof(propEntry) is returning more bytes than needed, so there's
> > actually room at the end.  So this should work:
> >
> >     docProp=(propEntry *) emalloc(sizeof(propEntry) + propLen);
> >     memcpy(docProp->propValue, propValue, propLen);
> >     docProp->propValue[propLen] = '\0';
> >     docProp->propLen = propLen;
> 
> So the same trick in append_property() and we're done? That sounds
> almost too easy :-)

I don't think it's needed in append_property().  append_property() is
only use during indexing.  AFAIK, the compare_properties() function is
only called after indexing is done, and append_property() is not used
then.

> It doesn't seem to be standardised at all. So I guess a fallback with
> autoconf to copying with bin2string() would be mandatory.

Or just call strcoll() on the docprop->propValue, since it's null
terminated now.

> > I know you mentioned this before, but what does strcoll do with case?
> > I wonder what to do with the is_meta_ignore_case test there.
> 
> I hope I understood you correctly here. strcoll() is a bit orthogonal
> to strcasecmp(). Basically, it is a strcmp(), but with a
> locale-dependent order of the characters.

Sorry, I wasn't writing clearly.  What does someone do if they want
strcoll() but ignoring case?  Is it a matter of getting a locale that
sorts in that way?

If we have to use tolower() ( or toupper() ) to do a "ignorecase"
compare then have to go back to making a copy in memory.  The question
is where to make that copy?

One option would be to add a new entry into the property so there's
both the normal prop string and also a tolower() version of the
string.  Then in Compare_Properties use p1->propValue_lower when
"ignorecase" is set.  That would be fastest, but most memory
intensive.

The other option is to make the copy and tolower() in
Compare_Properties using alloca if available, otherwise bin2string (or
strdup since the props should be null terminated now) to make a copy
and lowercase it.  Your tests were reasonably fast, but it just bugs
me to do too much work inside a function called by qsort.

Maybe the middle ground would be to create the lowercase version
in Compare_Properties and then cache it in the property.  That would
avoid doing it more than once.


> I don't know what the right way would be to deal with the difference.
> Maybe instead of a case:ignore flag one should introduce a
> collate:<insert locale here> flag, and adjust the locale appropriately
> on each document property. So the user would still be able to choose a
> per-property collating sequence.

Here's the way I implemented it.  configure checks for strcoll() and
if available swish is compiled with support for it.  Then you can set
the type of compare function:

    PropertyNamesCompareCase compare
    PropertyNamesIgnoreCase ignore
    PropertyNamesUseStrcoll strcoll



   swishlastmodified : id= 9 type=18  META_PROP:DATE
             compare : id=10 type= 6  META_PROP:STRING(case:compare) SortKeyLen: 100 
              ignore : id=11 type=70  META_PROP:STRING(case:ignore) SortKeyLen: 100 
             strcoll : id=12 type=262  META_PROP:STRING(case:strcoll) SortKeyLen: 100 

It's a bit confusing, if you did:

   PropertyNamesuseStrcoll prop1 prop2
   presortedIndex prop1

then prop1 is sorted based on the LC_COLLATE setting at indexing time,
but prop2 is sorted based on the LC_COLLATE setting at run time.

Now that can break, because there's cases where swish has to go back
and look at the properties during run time, even when there's a
pre-sorted index available for that property.

So if LC_COLLATE changes between indexing and run time then there will
be odd sorting.

One solution would be to store the locale used at indexing time and
use that at run time.  That limits the ability to modify the sort
order at run time, though.  There's also the risk that a locale may be
available where indexing is done, but not where searching is done.

I'm not sure about if that should be per-property or per-index,
though.  And what if sorting by multiple properties?  Would need to
change locale inside a qsort function.  I assume there's overhead in
doing that.

Also need to think about how merge works.  What if there's conflicting
locales?


> >    setlocale(LC_ALL, "");
> 
> Theoretically yes, but I don't know the code well enough to decide
> that. Setting LC_ALL will switch on locale-awareness for a lot of
> other functions. For example it'll also set LC_NUMERIC, which changes,
> for example, the output/input format of the printf()/scanf() functions
> dependent on the selected locale. I could imagine this could break
> some parts of the code, or users' code that depends on
> machine-readable output swish-e.

I can't think of any other place in swish locales are used.  The
commas in indexing output numbers are hard-coded.  So that might be
one place LC_NUMERIC could be used.

-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Tue Feb 8 07:46:37 2005