At 01:20 PM 04/12/02 -0700, Michael wrote:
>
>I certainly hope that the increase in index size is not a feature of
>2.1. I currently have some index files under 2.05 that are in excess
>of 50 megabytes. Having that get 15x bigger would not be very good.
There's more use of hash tables, so that bloats the initial size of the
index. I don't remember if 2.0.5 has a separate hash table for wildcard
searches like swish does now. Swish does track a bit more data on each
word indexed, but if anything I'd expect there to be better compression in
2.1.
But, hey, why discuss this. Might as well test:
Here's two indexing runs I just did.
For 24,735 html files indexed. 184,345,060 total bytes.
Version Max RAM Index size Indexing time
--------- ------ ---------- -------------
2.0.5 370MB 77MB 8m39s
2.1-dev 85MB 58MB 58s
Ok, let's grab a magazine.... A Barracuda ATA IV 9ms 80GB drive for
$179USD. That sounds like 13 cents to store my 58MB file.
First, 2.0.5:
Writing offsets (2)...
24735 files indexed.
Running time: 8 minutes, 39 seconds.
S USER PID PPID %CPU %MEM RSS VSZ COMMAND
R moseley 2019 421 94.7 72.3 371576 374272 ./swish-e -c doc -i
/home/moseley/usrdoc -v 1
(I have 1/2 GB in this machine.)
And index size for 2.0.5:
76,573,316 Apr 12 16:01 index.swish-e
2.1-dev
281532 unique words indexed.
4 properties sorted.
24735 files indexed. 184345060 total bytes. 20254324 total words.
Elapsed time: 00:00:58 CPU time: 00:00:58
And max memory usage:
S USER PID PPID %CPU %MEM RSS VSZ COMMAND
R moseley 2004 421 98.0 16.5 85040 88612 ./swish-e -c doc -i
/home/moseley/usrdoc
55,736,889 Apr 12 15:50 index.swish-e
2,519,679 Apr 12 15:50 index.swish-e.prop
So in this case, 2.1-dev used less RAM, less disk, and less time. What a
deal.
The config file for both was:
~/swish-e/src$ cat doc
IndexOnly .htm .html
--
Bill Moseley
mailto:moseley@hank.org
Received on Fri Apr 12 23:33:12 2002