Skip to main content.
home | support | download

Back to List Archive

Re: Swish-e with incremental mode still crashes sometimes

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Fri Sep 30 2005 - 14:13:43 GMT
On Fri, Sep 30, 2005 at 04:21:57AM -0700, Uwe Dierolf wrote:
> we still have problems with Swish-e compiled with incremental mode.
> It still crashes sometimes while updating an index.
> We use the incremental mode to delete from and insert single records
> into our index with about 1 million XML records.
> If the crash occurs the index seems to work but it's damaged.
> We only recognized that it must be defect cause of strange search results.
> Perhaps you have an idea how to find this ugly bug.
> We regret that we can not give you test data to reproduce a crash.

I'm not sure how closely Jose watches this list, so you might try
emailing him directly.  Is it possible to get a back trace of the
crash?

> Swish-e spends most of the time sorting properties. We profiled the code to 
> identify the functions:
> 
> 82 % of the time in  compFileProps (pre_sort.c)
> 13 % of the time in  CreatePropSortArray (pre_sort.c)

Nothing odd about that.  All CreatePropSortArray does is create a big
array and populate it from disk, and then calls qsort, which uses
compFileProps.

One option would be to turn off presorting on all or most properties,
depending on how you sort your results.  Swish sorts all properties by
default.

Is this a possibility?  Do you only generate results by rank?  Or are
your result sets very small?

> 
> Would it be feasable to:
> 
> 1) Use another sorting algorithm (not quicksort) in case of incremental update.
> For example insertion sort or shell sort perform much better on almost sorted
> data (inserting few records in a already sorted list). See for example:
> 
>     http://www.softpanorama.org/Algorithms/sorting.shtml

Hard to imagine a shell sort would be faster.  All the time should be
is likely spent in reading the property file.

Searching is where you want the speed, and most people want fast
indexing at the expense of indexing speed.

The final result of the sort is an integer array total docs long with
the docs numbered sequentially.  This makes it very fast to determine
a docs sort order compared to other documents -- it's just a lookup in
an index array.

The problem with this, though, is the array does not contain any
information about the original data that was sorted.  So to insert you
have to go back to the original data.  Currently, swish just recreates
the index from scratch.

One possible solution to avoid reading all the properties again would
be to use a binary sort.  This is what is done when using -L to limit
results in to a range of properties.

With -L swish takes the pre-sorted property index, sorts it by the
sort value, then does a binary search to find the location of the
property.  With incremental it could then insert the new and re-number
all records with a larger sort value, and then finally resort the
entire array back into document record number.

That's still two sort operations, plus the binary search (and reading
properties off disk).  It would likely be much faster, though.

> 2) Take the discrimination between strcoll, ignore_case, use_case (see
> Compare_Properties [docprop.c] which is called from compFileProps) out of the
> loop which implicit in swish_qsort.
> 
> That is instead of:
> 
>     // sorting one property
>     loop
>        if (is_meta_ignore_case) then strncasecmp (val1, val2)
>        if (is_meta_use_strcoll) then strcoll     (val1, val2)
>     endloop
> 
> use:
> 
>     // sorting one property
>     if (is_meta_ignore_case) then cmp_function = strncasecmp
>     if (is_meta_use_strcoll) then cmp_function = strcoll
>     loop
>         call cmp_function (val1, val2)
>     endloop

I would agree that any extra work in the compare function should be
limited, because that function is called so many times.  But, the test
on the metaType is a memory access and a bit-wise and.  Those are two
things the CPU can do very quickly.  Now, if the memory is paged out,
well...

Feel free to try removing those "if" tests.  IIRC, once upon a time
swish did set the function like your example above to avoid the checks
in the compare function.  Try setting up a config with just one
property that needs sorting (so you know its string type) and then
rewrite the compare function to do just that compare and see how much
faster it is.

If it is much faster then it won't be much work to create separate
functions for each type of property and pass that function to qsort.

-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Fri Sep 30 07:13:44 2005