Skip to main content.
home | support | download

Back to List Archive

Re: Indexing performances, multi millions words

From: Jean-François PIÉRONNE <jfp(at)not-real.altavista.net>
Date: Fri Dec 28 2001 - 15:38:18 GMT
> 
> There's a new config setting that's in CVS -- I'm not sure how long ago I
> checked it in.  I'm not even sure if it will stay in swish.
> 
>    IgnoreNumberChars
> 
> For example, if you set it like:
> 
>    IgnoreNumberChars 0123456789abcdef
> 
> It won't index hex numbers.  Of course it won't index the word "bad"
> either.  So you can see it has limited use for hex.  It's really more
> designed for
> 
>    IgnoreNumberChars 0123456789,.
> 
> But, I'm still not clear if it's worth keeping in the source.
> 

Indexing using 
    IgnoreNumberChars 0123456789abcdef

remove 3 M words !!!

> Keep in mind that indexing is MUCH faster and more memory efficient than in
> 1.3 (or 2.0.x).  I mention this often, but my /usr/doc of 25,000 files
> indexes in about 4 minutes (on my 128M system).  Not too long ago that was
> 15 minutes.  And before that it was about three hours (with swapping).  An
> index that took nine hours on sunsite now takes about 15 minutes, if I
> remember correctly.
> 
> Swish doesn't get pushed into the millions of files, or millions of words
> too often, so these speed issues don't show up much.  But, it's a good time
> to try to optimize even more.  So, if you can find anything that improves
> the indexing, then that's great.
> 

I have done some tests indexing my big directory and a much smaller on using
differents values for the 3 HASHSIZE parameters:

==========================
Large number of words
--------------------------

1707117 unique words indexed.
8 properties sorted.
11151 files indexed.  1234505869 total bytes.  85728819 total words.

--------------------------------------------------------------
USE_BTREE not defined

HASHSIZE 10007
BIGHASHSIZE 100003
SEARCHHASHSIZE 1000003
Elapsed time: 00:18:17 CPU time: 00:16:41


HASHSIZE 1009
BIGHASHSIZE 10001
SEARCHHASHSIZE 100003
Elapsed time: 00:22:25 CPU time: 00:20:42

HASHSIZE 101
BIGHASHSIZE 1009
SEARCHHASHSIZE 10001
Elapsed time: 01:23:14 CPU time: 01:20:47

--------------------------------------------------------------
USE_BTREE defined


HASHSIZE 1009
BIGHASHSIZE 10001
SEARCHHASHSIZE 100003
Elapsed time: 00:25:27 CPU time: 00:20:48

HASHSIZE 101
BIGHASHSIZE 1009
SEARCHHASHSIZE 10001
Elapsed time: 01:13:54 CPU time: 01:08:58


==========================
Small number of words
--------------------------

203651 unique words indexed.
8 properties sorted.
8294 files indexed.  178915672 total bytes.  13329601 total words.

HASHSIZE 101
BIGHASHSIZE 1009
SEARCHHASHSIZE 10001
Elapsed time: 00:04:53 CPU time: 00:03:49

HASHSIZE 1009
BIGHASHSIZE 10001
SEARCHHASHSIZE 100003
Elapsed time: 00:03:27 CPU time: 00:02:48



As you can see there is a big win to increase the *HASHSIZE parameters

So, IMHO, it would be better to default the three HASHSIZE using the following
setting
HASHSIZE 1009
BIGHASHSIZE 10001
SEARCHHASHSIZE 100003


I haven't done any test varying only one ot the three parameter.



> BTW -- Jose has tested using a btree-type of indexing scheme.  I'm not sure
> of it's current state, but you might (for fun) try indexing with:
> 
> #define USE_BTREE
> 
> in config.h.  I don't think you can use -e with btree.
> 

Work, with near (very good) identical performances.


Jean-François
Received on Fri Dec 28 15:38:47 2001