Skip to main content.
home | support | download

Back to List Archive

Best way to index large datasets

From: Mark Fletcher <markf(at)>
Date: Mon Sep 08 2003 - 04:02:23 GMT

We use Swish-e for our service, Bloglines, and have been very happy with 
it. Currently, it's indexing about a gigabyte of data, consisting of 
about 2 million html "pages," but that's rapidly expanding. Right now, a 
full index is taking around 3 hours on our hardware. It's not memory 
constrained. What can I do going forward to deal with the increasing 
data? An immediate desire is to take advantage of more than one 
processor, so I was thinking of just spliting the data over multiple 
index files. Is there any advantage to merging indexes into one big file 
instead of just using several smaller files? Is there a performance hit 
on searching multiple files? Anything else I should be considering?


Mark Fletcher
Received on Mon Sep 8 04:03:55 2003