Skip to main content.
home | support | download

Back to List Archive

Re: Indexing performances, multi millions words

From: Jean-François PIÉRONNE <jfp(at)not-real.altavista.net>
Date: Thu Dec 27 2001 - 01:47:02 GMT
> 
> At 04:05 PM 12/26/01 -0800, Jean-François PIÉRONNE wrote:
> >> But, 4.5 million unique "words"?  That's a lot of words.  Are you really
> >> going to search those words?
> >>
> >
> >The files are sources listing (OpenVMS sources listing) which contains lot of
> >number in decimal, hexadecimal and C hex format (0x format) but i haven't
> found
> >how to not index the two hex formats (for the first format i have define
> >IGNOREALLN to 1 in config.h)
> 
> There's a new config setting that's in CVS -- I'm not sure how long ago I
> checked it in.  I'm not even sure if it will stay in swish.
> 
>    IgnoreNumberChars
> 
> For example, if you set it like:
> 
>    IgnoreNumberChars 0123456789abcdef
> 
> It won't index hex numbers.  Of course it won't index the word "bad"
> either.  So you can see it has limited use for hex.  It's really more
> designed for
> 
>    IgnoreNumberChars 0123456789,.
> 
> But, I'm still not clear if it's worth keeping in the source.
>
> Another approach would be to use -S prog and simply use regex matching to
> remove all the hex numbers from the source before sending to swish.
>


What about a new config setting like 
IgnoreWords regex

where regex is a regular expression which match the words that have to be
ignored.
So if the word match regex it is discard.


 
> Regarding -e and not -e:  If your OS is caching the temporary file in RAM
> anyway, might as well run without -e.  You can imagine that -e with your
> index size would really work the disk drive.  While in the step "Writing
> Word Data" -e has to seek all over the place to collect all the words (word
> position data) from different documents together.  Without -e, the words
> are just linked together in memory while indexing.
> 
> I'm also not clear if there's any optimization (or guesses) done to prevent
> reallocating memory too often or not when using very large number of unique
> words.
> 
> >With '-e' switch:
> >The process, until it reach the "Writing word data:" point, took 30' CPU
> and use
> >350 MB of memory, no paging or swaping.
> 
> >Without '-e' switch:
> >The process, until it reach the "Writing word data:" point, took 38' CPU
> and use
> >350 MB of memory, generate more than 10 M pages faults.
> 
> Page faults meaning that it accessed memory that had been swapped out to disk?
>

OpenVMS has a concept of working set, which is the maximum physical memory a
process can map at any time.
There is also two caches (free list and modified list).
So if a page is remove from the working set of a process, it first go into one
of the two lists, then, if there is not enougth memory eventually write into a
pagefile.
Durring the run, page just go to the cache, and was found into that cache the
next time there is a reference for that page (this is called a soft fault, which
mean there is no I/O).
 
> Does that mean you really don't have enough RAM for indexing?  Or do you
> have some memory limits that force swapping to disk?
> 

Right, my current system parameter limit to 350 Mo the physical memory that can
be used by a process.

I have extend this limit to 700 Mo , then pages faults disappear and the process
used 680 MB and took 39' CPU.

> >So the  switch '-e' seem to made the "Writing word data:" step very costly.
> 
> I think -e is very costly.  While indexing the word position data are
> written sequentially out to disk, and so it's a lot of disk i/o reading
> back in.  I think the point of -e is for people that don't have enough
> memory, where indexing would basically make the machine swap to death.
> 
> Keep in mind that indexing is MUCH faster and more memory efficient than in
> 1.3 (or 2.0.x).  I mention this often, but my /usr/doc of 25,000 files
> indexes in about 4 minutes (on my 128M system).  Not too long ago that was
> 15 minutes.  And before that it was about three hours (with swapping).  An
> index that took nine hours on sunsite now takes about 15 minutes, if I
> remember correctly.
> 

Agree, i have other documentation which are indexed very fast , remember my
previous comparaison between SWISH-E and HTDIG.

> Swish doesn't get pushed into the millions of files, or millions of words
> too often, so these speed issues don't show up much.  But, it's a good time
> to try to optimize even more.  So, if you can find anything that improves
> the indexing, then that's great.
> 
> BTW -- Jose has tested using a btree-type of indexing scheme.  I'm not sure
> of it's current state, but you might (for fun) try indexing with:
> 
> #define USE_BTREE
> 
> in config.h.  I don't think you can use -e with btree.
> 

Ok, i will try it, for fun :-)


Thanks,

Jean-François
Received on Thu Dec 27 01:48:30 2001