Skip to main content.
home | support | download

Back to List Archive

Index size (was: [BUG] swish-e 2.0.5 hangs on 200 Kb index)

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Fri Apr 12 2002 - 23:31:39 GMT
At 01:20 PM 04/12/02 -0700, Michael wrote:
> 
>I certainly hope that the increase in index size is not a feature of 
>2.1. I currently have some index files under 2.05 that are in excess 
>of 50 megabytes. Having that get 15x bigger would not be very good.

There's more use of hash tables, so that bloats the initial size of the
index.  I don't remember if 2.0.5 has a separate hash table for wildcard
searches like swish does now.  Swish does track a bit more data on each
word indexed, but if anything I'd expect there to be better compression in
2.1.

But, hey, why discuss this.  Might as well test:

Here's two indexing runs I just did.

For 24,735 html files indexed.  184,345,060 total bytes.

   Version   Max RAM   Index size   Indexing time
  ---------   ------   ----------   -------------
   2.0.5      370MB       77MB          8m39s 
   2.1-dev     85MB       58MB            58s

Ok, let's grab a magazine.... A Barracuda ATA IV 9ms 80GB drive for
$179USD.  That sounds like 13 cents to store my 58MB file.

First, 2.0.5:

Writing offsets (2)...
24735 files indexed.
Running time: 8 minutes, 39 seconds.

S USER       PID  PPID %CPU %MEM  RSS   VSZ COMMAND
R moseley   2019   421 94.7 72.3 371576 374272 ./swish-e -c doc -i
/home/moseley/usrdoc -v 1

(I have 1/2 GB in this machine.)

And index size for 2.0.5:

   76,573,316 Apr 12 16:01 index.swish-e


2.1-dev

281532 unique words indexed.
4 properties sorted.                                              
24735 files indexed.  184345060 total bytes.  20254324 total words.
Elapsed time: 00:00:58 CPU time: 00:00:58

And max memory usage:

S USER       PID  PPID %CPU %MEM  RSS   VSZ COMMAND
R moseley   2004   421 98.0 16.5 85040 88612 ./swish-e -c doc -i
/home/moseley/usrdoc

   55,736,889 Apr 12 15:50 index.swish-e
    2,519,679 Apr 12 15:50 index.swish-e.prop

So in this case, 2.1-dev used less RAM, less disk, and less time.  What a
deal.

The config file for both was:

~/swish-e/src$ cat doc
IndexOnly .htm .html



-- 
Bill Moseley
mailto:moseley@hank.org
Received on Fri Apr 12 23:33:12 2002