Skip to main content.
home | support | download

Back to List Archive

Re: Best way to index large datasets

From: <moseley(at)not-real.hank.org>
Date: Mon Sep 08 2003 - 14:44:07 GMT
On Sun, Sep 07, 2003 at 09:00:33PM -0700, Mark Fletcher wrote:
> Hello,
> 
> We use Swish-e for our service, Bloglines, and have been very happy with 
> it. Currently, it's indexing about a gigabyte of data, consisting of 
> about 2 million html "pages," but that's rapidly expanding. Right now, a 
> full index is taking around 3 hours on our hardware. It's not memory 
> constrained. What can I do going forward to deal with the increasing 
> data? An immediate desire is to take advantage of more than one 
> processor, so I was thinking of just spliting the data over multiple 
> index files. Is there any advantage to merging indexes into one big file 
> instead of just using several smaller files? Is there a performance hit 
> on searching multiple files? Anything else I should be considering?

I assume you are using -e when indexing (otherwise you would have huge 
memory usage and very slow indexing).

Searching multiple indexes is slower mostly because of the sorting -- 
especially if sorting by properties defined in your source files (like 
title or path).

At the end of indexing swish goes through each property and generates an 
integer table based on the sort order of that property.  When sorting 
during a search it uses this table to quickly sort the results.  This 
avoids reading the property off disk for each result and comparing it 
and it's often faster to compare integers than strings.  It's a nice 
optimization, but it costs memory --- for your 2 million files it's 
loading a 8MB table into memory for each sort property.

Now, if you are searching two indexes the same process happens, but for
each index.  But then there is an extra step when printing out the
results: a "tape-merge" must be done on the two indexes based on their
real properties.  Since those "pre-sorted" integer tables are specific 
to a given index they cannot be compared between different indexes.  
So, the properties must be read off disk when generating results.  It's 
not too bad when printing the first page of results (say -b 1 -m 15), 
but if you have 500,000 results in your set and someone wants to see the 
last 15 results then at least 499,985 properties must be read off disk.

Merging might be better.  It avoids the process of re-parsing all the 
documents.  But it depends on how your collection is changing.

If you are just adding files to your collection you might want to try 
using the configure option:

   --enable-incremental

Then there's a -u switch which should just add records to the index when 
indexing.  It's *experimental* and not many people have tested it, I 
think.  It's not really incremental because you can only add files, not 
replace or remove.  The index files are also not compatible with swish-e 
built without --enable-incremental.

Finally, you might look at config.h and swish.h and try messing with
hash table sizes and other parameters that define some working sizes and 
limits.  Jose could probably give better directions there.

In swish.h there's things like:

#define HASHSIZE 1009
#define BIGHASHSIZE 10001
#define VERYBIGHASHSIZE 100003

I'm not sure if changing those will have an effect.  I have noticed when 
indexing without -e that indexing slows over time -- and that makes 
sense as the hash tables fill up.




-- 
Bill Moseley
moseley@hank.org
Received on Mon Sep 8 14:45:39 2003