Skip to main content.
home | support | download

Back to List Archive

Re: Skipping articles while sorting

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Fri Jan 02 2004 - 22:54:44 GMT
On Fri, Jan 02, 2004 at 05:32:55PM -0500, Thomas Dowling wrote:
> FWIW, this can be a frustratingly difficult job to tackle.  There will 
> inevitably be documents with titles like "A & P hiring cashiers, 
> baggers" or the famous "THE Journal" (Texas Higher Education, if memory 
> serves).  You also need to think now about multilingual support, which 
> opens cans of worms like interpreting if "Die" is a German article or an 
> English word.

I agree.  It's been requested a few times, and I've been hesitant to
implement it for those reasons.

> This is why libraries gave up on the job years ago and rely on catalog 
> records to tell any sorting routine how many characters to skip over.  :-\

That would be better, but don't really have access to that other meta
data while sorting.  Plus, it would mean that documents would need to
have the offset stored somehow in them.

> I'd recommend doing any article trimming at sort time rather than index 
> time, with a way in the user interface to turn it off (or on, depending 
> on what the default is set to)

I'll look into that.

Sorting in swish-e is really done at indexing time.  What happens is at
indexing time the properties are sorted and then an integer table of the
sort order is stored.  Then when searching that table is used for
sorting, not the actual strings.  

Need to benchmark.  I'll bet for normal result sets that the pre-sorted
tables don't help that much.  We use qsort which needs the sort keys in
memory, so it's a memory savings to use integers than the actual
property strings.

You are correct -- it's a can of worms.


-- 
Bill Moseley
moseley@hank.org
Received on Fri Jan 2 22:54:53 2004