Skip to main content.
home | support | download

Back to List Archive

RE: RE: swish-e dumping core when indexing large file sets

From: Scott Schultz <scott(at)not-real.ceweekly.com>
Date: Fri Jun 25 1999 - 16:26:24 GMT
>From: cstephenson@ccmail.uwsa.edu [mailto:cstephenson@ccmail.uwsa.edu]
     
>     The solution has been to index smaller chunks of files and then to 
>     merge the indexes together.
     
>     When merging large index files, swish will dump core also.
     
>     Will upgrading to the 1.3.2 version of Swish-e remedy this?  If not, 
>     is there another solution?
     
There IS another solution if you find that merging is too expensive
in terms of time or memory. I'm not sure if it's really documented
anywhere, (I haven't looked at the man page in a while) but Swish-E 
is able to search more than one index at a time. Instead of searching 
one huge index, you can give the "-f" flag a list of index files to 
search. I haven't done any investigation into what the upper limits 
of the index list might be.

There's a downside to doing this. By searching a bunch of individual
indices, you get the results back in an semi-sorted way; that is, if you 
use three indices (a, b, and c) then your rankings will look like this

a1,a2,a3,b1,b2,b3,c1,c2,c3 (where 1 is the best match and 3 is the worst)

but you probably want them to look like this:

a1,b1,c1,a2,b2,c2,a3,b3,c3 (or whatever the best sorting would be)

If you don't care about the ranking order of the results, or you find
the expense of sorting them in your scripting engine to be acceptable,
then this can be a viable alternative to creating a giant index for a 
lot of files. 

This can work to your advantage as well. It allows you to group documents
based on some criteria and only index those documents when neccesary, 
instead of indexing the entire tree. It's also useful if you deliberately
want to influence the order of the results. At the site I administer, 
we have multiple classes of search articles. I wanted one class of articles
to appear before the other class, no matter what the internal Swish-E 
ranking. By creating two indices, the documents from class A are all 
returned before the documents in class B.

The effect is the same as if you had called swish-e multiple times on
different indices but it saves you the system overhead of launching a
new process every time. It's not the best solution for everyone but 
it works well for certain kinds of applications.

Scott Schultz
scott@ceweekly.com 
Received on Fri Jun 25 09:26:31 1999