Skip to main content.
home | support | download

Back to List Archive

Re: More on indexing and memory requirements in swish-e

From: Bas Meijer <bas(at)>
Date: Thu Aug 31 2000 - 14:52:28 GMT
Hi ,

40% memory-gain in a few days is pretty cool!! Who knows what Jose 
can do after a nice holiday? We want swish-e to be a nice citizen 
along other processes, sometimes swish-e is running in a 'shared' 
environment where systems managers get pretty freaky when there is a 
process consuming most of the cycles and most of memory; sometimes it 
just won't do is because there's not enough resources
, like indexing four years of discussions amongst 4000 network 
managers, some 176.000 html-files at last count, Sun/SGI RAM is more 
expensive than others' :-{

It would be great to have a fast & scalable indexer. Maybe there 
could be a merge function that doesn't do all in RAM? Temp-files 
could be used?


At 06:39 -0700 31-08-2000, wrote:
>Hi all,
>The old news...
>As you have read in previous posts to this list, swish-e 2.x
>is consuming a really big amount of memory in the index
>proccess. Many of this memory is used for storing the
>words info:
>- file number (index to the file info)
>- metaname (it is 1 if no metaname, 2,3 for the rest)
>- structure (stores if the word is in head, body, title ...)
>- frequency (the number of occurences of the word in the file)
>- positions (the positions of the word in the file) This can be a
>repetitive value.
>Each of these values needs 4 bytes.
>Now, the new and good news...
>Many of that info can be compressed to save memory. So I
>decided to make a try and modify the code to handle it. Here are
>the results:
>The test case contains 10000 files and 35000 different words.
>Each file contains about 70 words with 7 fields (metaNames) and 5
>The test box is a SUN Solaris 2.6 (400 MHZ) with 512MB.
>(Note: All the files are in memory cache to minimize the effect of
>the filesystem I/O).
>swish-e-2.0.1 needed 33 MB of RAM and the index time was 33
>"Modified" swish-e 2.x (including new index engine and beta
>compression option) needed 20 MB RAM ant the index time was
>35 seconds.
>Both output index files are identical (except for the date/time of the
>the header info).
>As you see, there is a reduction in memory usage of about 40%.
>I do not know if this is enough. Of course, it depends on how many
>docs are being indexed and how powerful are your machine
>I will release this modifications after completing them (Need to add
>them to merge option).
>Now, it is time for my vacation.
>cu on Sept 17


--  /'''     Bas Meijer
     c-OO Web Services
     \  >     Kerkstraat 19 Postbus 256 1400 AG Bussum
      \&&     t. +31 35 7502100  f. +31 35 7502111
Received on Thu Aug 31 14:56:42 2000