Skip to main content.
home | support | download

Back to List Archive

Big indexes

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Thu Jun 13 2002 - 20:49:19 GMT
I wrote a little test program that randomly selects words from a dictionary
and builds a "file" to index using -S prog.

It's clear how performance drops off with the number of files indexed.  It
may be that our hashes can be tuned better.  But it also looks like using
-e is a very good thing to do.  Well, to a point.

In this first test I let it run to about 100,000 files of average size of
1,964 bytes.  This used about 1/4G or RAM.

PID TTY      STAT   TIME  MAJFL   TRS   DRS  RSS %MEM COMMAND
2028 pts/0    R      5:44    224   243 248828 244616 47.6 ./swish-e -S prog
-c 


moseley@bumby:~/swish-e/src$ ./swish-e -S prog -c c
Indexing Data Source: "External-Program"
Indexing "./prog.pl"
[./prog.pl] Setting file count = 1000000
[./prog.pl] Send a 'kill -hup 2029' to abort
File 5000  534.16/second over 5000 records.
File 10000  483.20/second over 5000 records.
File 15000  441.84/second over 5000 records.
File 20000  407.12/second over 5000 records.
File 25000  377.15/second over 5000 records.
File 30000  351.27/second over 5000 records.
File 35000  328.86/second over 5000 records.
File 40000  308.51/second over 5000 records.
File 45000  289.61/second over 5000 records.
File 50000  273.41/second over 5000 records.
File 55000  257.65/second over 5000 records.
File 60000  243.72/second over 5000 records.
File 65000  231.59/second over 5000 records.
File 70000  220.66/second over 5000 records.
File 75000  210.78/second over 5000 records.
File 80000  201.33/second over 5000 records.
File 85000  192.95/second over 5000 records.
File 90000  185.03/second over 5000 records.
File 95000  177.96/second over 5000 records.
File 100000  170.70/second over 5000 records.
[./prog.pl] Aborted at record 104714
File 104714  255.56/second over 104714 records.
Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 45373 words alphabetically
Writing header ...
Writing index entries ...
  Writing word text: Complete
  Writing word hash: Complete
  Writing word data: Complete
45373 unique words indexed.
4 properties sorted.                                              
104713 files indexed.  205733891 total bytes.  21268870 total words.
Elapsed time: 00:07:02 CPU time: 00:06:06
Indexing done!


Now, here's using -e

Memory usage is much better 10MB instead of 1/4GB, and better over all file
processing speed.  310 files per second with -e vs. 255/second without -e.

PID TTY      STAT   TIME  MAJFL   TRS   DRS  RSS %MEM COMMAND
2063 pts/0    R      8:31    232   243 12432 10348  2.0 ./swish-e -S prog
-c c 
 
moseley@bumby:~/swish-e/src$ ./swish-e -S prog -c c -e
Indexing Data Source: "External-Program"
Indexing "./prog.pl"
[./prog.pl] Setting file count = 1000000
[./prog.pl] Send a 'kill -hup 2064' to abort
File 5000  313.82/second over 5000 records.
File 10000  311.03/second over 5000 records.
File 15000  310.38/second over 5000 records.
File 20000  309.69/second over 5000 records.
File 25000  309.97/second over 5000 records.
File 30000  310.03/second over 5000 records.
File 35000  310.05/second over 5000 records.
File 40000  310.00/second over 5000 records.
File 45000  309.54/second over 5000 records.
File 50000  311.07/second over 5000 records.
File 55000  310.09/second over 5000 records.
File 60000  310.48/second over 5000 records.
File 65000  310.81/second over 5000 records.
File 70000  310.61/second over 5000 records.
File 75000  310.90/second over 5000 records.
File 80000  310.14/second over 5000 records.
File 85000  310.55/second over 5000 records.
File 90000  311.14/second over 5000 records.
File 95000  309.71/second over 5000 records.
File 100000  309.83/second over 5000 records.
File 105000  310.51/second over 5000 records.
File 110000  310.95/second over 5000 records.
File 115000  310.43/second over 5000 records.
File 120000  310.54/second over 5000 records.
File 125000  309.89/second over 5000 records.
[./prog.pl] Aborted at record 128714
File 128714  310.39/second over 128714 records.
Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 45373 words alphabetically
Writing header ...
Writing index entries ...
  Writing word text: Complete
  Writing word hash: Complete
  Writing word data: Complete
45373 unique words indexed.
4 properties sorted.                                              
128713 files indexed.  252888063 total bytes.  26143735 total words.
Elapsed time: 00:09:46 CPU time: 00:08:34
Indexing done

I tried indexing 1,000,000 files but exceeded my 2GB:

File 880000  309.66/second over 5000 records.
File size limit exceeded

That took about a little an hour and a quarter to get to that point. 



-- 
Bill Moseley
mailto:moseley@hank.org
Received on Thu Jun 13 20:53:31 2002