>
> There's a new config setting that's in CVS -- I'm not sure how long ago I
> checked it in. I'm not even sure if it will stay in swish.
>
> IgnoreNumberChars
>
> For example, if you set it like:
>
> IgnoreNumberChars 0123456789abcdef
>
> It won't index hex numbers. Of course it won't index the word "bad"
> either. So you can see it has limited use for hex. It's really more
> designed for
>
> IgnoreNumberChars 0123456789,.
>
> But, I'm still not clear if it's worth keeping in the source.
>
Indexing using
IgnoreNumberChars 0123456789abcdef
remove 3 M words !!!
> Keep in mind that indexing is MUCH faster and more memory efficient than in
> 1.3 (or 2.0.x). I mention this often, but my /usr/doc of 25,000 files
> indexes in about 4 minutes (on my 128M system). Not too long ago that was
> 15 minutes. And before that it was about three hours (with swapping). An
> index that took nine hours on sunsite now takes about 15 minutes, if I
> remember correctly.
>
> Swish doesn't get pushed into the millions of files, or millions of words
> too often, so these speed issues don't show up much. But, it's a good time
> to try to optimize even more. So, if you can find anything that improves
> the indexing, then that's great.
>
I have done some tests indexing my big directory and a much smaller on using
differents values for the 3 HASHSIZE parameters:
==========================
Large number of words
--------------------------
1707117 unique words indexed.
8 properties sorted.
11151 files indexed. 1234505869 total bytes. 85728819 total words.
--------------------------------------------------------------
USE_BTREE not defined
HASHSIZE 10007
BIGHASHSIZE 100003
SEARCHHASHSIZE 1000003
Elapsed time: 00:18:17 CPU time: 00:16:41
HASHSIZE 1009
BIGHASHSIZE 10001
SEARCHHASHSIZE 100003
Elapsed time: 00:22:25 CPU time: 00:20:42
HASHSIZE 101
BIGHASHSIZE 1009
SEARCHHASHSIZE 10001
Elapsed time: 01:23:14 CPU time: 01:20:47
--------------------------------------------------------------
USE_BTREE defined
HASHSIZE 1009
BIGHASHSIZE 10001
SEARCHHASHSIZE 100003
Elapsed time: 00:25:27 CPU time: 00:20:48
HASHSIZE 101
BIGHASHSIZE 1009
SEARCHHASHSIZE 10001
Elapsed time: 01:13:54 CPU time: 01:08:58
==========================
Small number of words
--------------------------
203651 unique words indexed.
8 properties sorted.
8294 files indexed. 178915672 total bytes. 13329601 total words.
HASHSIZE 101
BIGHASHSIZE 1009
SEARCHHASHSIZE 10001
Elapsed time: 00:04:53 CPU time: 00:03:49
HASHSIZE 1009
BIGHASHSIZE 10001
SEARCHHASHSIZE 100003
Elapsed time: 00:03:27 CPU time: 00:02:48
As you can see there is a big win to increase the *HASHSIZE parameters
So, IMHO, it would be better to default the three HASHSIZE using the following
setting
HASHSIZE 1009
BIGHASHSIZE 10001
SEARCHHASHSIZE 100003
I haven't done any test varying only one ot the three parameter.
> BTW -- Jose has tested using a btree-type of indexing scheme. I'm not sure
> of it's current state, but you might (for fun) try indexing with:
>
> #define USE_BTREE
>
> in config.h. I don't think you can use -e with btree.
>
Work, with near (very good) identical performances.
Jean-François
Received on Fri Dec 28 15:38:47 2001