Skip to main content.
home | support | download

Back to List Archive

Re: Indexing performances, multi millions words

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Wed Dec 26 2001 - 19:21:21 GMT
At 02:53 AM 12/26/01 -0800, Jean-François PIÉRONNE wrote:
>indexing large documents (more than 11000 files, 1.2 Go and near 4.5 M
words), i
>have noticed that the indexing times can be heavily reduced when i
increased the
>three "#define" HASHSIZE, BIGHASHSIZE, SEARCHHASHSIZE.
>
>I don't know which of the three is the most significant, but indexing time
drop
>from 6 hours to less than 2 hours, and these 2 hours are mostly CPU bound.

Jose will need to comment on those settings.  I've played with them a
little but didn't see much change.  What specific settings did you use?

But if you really have 4.5 million words, then maybe increasing the hash
size would help.  A larger hash index would mean less stepping through
words one-by-one with the same hash value.  There may be some other reasons
to use higher values - going from six hours to two hours makes me think you
went from swapping to not swapping.  Was the machine load (and memory
demand) the same for each run.  That is, were there other programs
demanding memory on the six hour run?

But, 4.5 million unique "words"?  That's a lot of words.  Are you really
going to search those words?

I did this the other day:

679890 unique words indexed.
2 properties sorted.                                              
38740 files indexed.  455105121 total bytes.  19705343 total words.
Elapsed time: 00:11:12 CPU time: 00:09:40
Indexing done!

That was on a BSD machine with a load average of about *ten*.

680K unique words is a lot, I thought, and I discovered that was due to
indexing a mail archive with MIME attachments.  Without the mail archive it's:

75495 unique words indexed.
2 properties sorted.                                              
29505 files indexed.  384191825 total bytes.  12758054 total words.
Elapsed time: 00:07:09 CPU time: 00:05:45
Indexing done!




Bill Moseley
mailto:moseley@hank.org
Received on Wed Dec 26 19:21:28 2001