>
> At 04:05 PM 12/26/01 -0800, Jean-François PIÉRONNE wrote:
> >> But, 4.5 million unique "words"? That's a lot of words. Are you really
> >> going to search those words?
> >>
> >
> >The files are sources listing (OpenVMS sources listing) which contains lot of
> >number in decimal, hexadecimal and C hex format (0x format) but i haven't
> found
> >how to not index the two hex formats (for the first format i have define
> >IGNOREALLN to 1 in config.h)
>
> There's a new config setting that's in CVS -- I'm not sure how long ago I
> checked it in. I'm not even sure if it will stay in swish.
>
> IgnoreNumberChars
>
> For example, if you set it like:
>
> IgnoreNumberChars 0123456789abcdef
>
> It won't index hex numbers. Of course it won't index the word "bad"
> either. So you can see it has limited use for hex. It's really more
> designed for
>
> IgnoreNumberChars 0123456789,.
>
> But, I'm still not clear if it's worth keeping in the source.
>
> Another approach would be to use -S prog and simply use regex matching to
> remove all the hex numbers from the source before sending to swish.
>
What about a new config setting like
IgnoreWords regex
where regex is a regular expression which match the words that have to be
ignored.
So if the word match regex it is discard.
> Regarding -e and not -e: If your OS is caching the temporary file in RAM
> anyway, might as well run without -e. You can imagine that -e with your
> index size would really work the disk drive. While in the step "Writing
> Word Data" -e has to seek all over the place to collect all the words (word
> position data) from different documents together. Without -e, the words
> are just linked together in memory while indexing.
>
> I'm also not clear if there's any optimization (or guesses) done to prevent
> reallocating memory too often or not when using very large number of unique
> words.
>
> >With '-e' switch:
> >The process, until it reach the "Writing word data:" point, took 30' CPU
> and use
> >350 MB of memory, no paging or swaping.
>
> >Without '-e' switch:
> >The process, until it reach the "Writing word data:" point, took 38' CPU
> and use
> >350 MB of memory, generate more than 10 M pages faults.
>
> Page faults meaning that it accessed memory that had been swapped out to disk?
>
OpenVMS has a concept of working set, which is the maximum physical memory a
process can map at any time.
There is also two caches (free list and modified list).
So if a page is remove from the working set of a process, it first go into one
of the two lists, then, if there is not enougth memory eventually write into a
pagefile.
Durring the run, page just go to the cache, and was found into that cache the
next time there is a reference for that page (this is called a soft fault, which
mean there is no I/O).
> Does that mean you really don't have enough RAM for indexing? Or do you
> have some memory limits that force swapping to disk?
>
Right, my current system parameter limit to 350 Mo the physical memory that can
be used by a process.
I have extend this limit to 700 Mo , then pages faults disappear and the process
used 680 MB and took 39' CPU.
> >So the switch '-e' seem to made the "Writing word data:" step very costly.
>
> I think -e is very costly. While indexing the word position data are
> written sequentially out to disk, and so it's a lot of disk i/o reading
> back in. I think the point of -e is for people that don't have enough
> memory, where indexing would basically make the machine swap to death.
>
> Keep in mind that indexing is MUCH faster and more memory efficient than in
> 1.3 (or 2.0.x). I mention this often, but my /usr/doc of 25,000 files
> indexes in about 4 minutes (on my 128M system). Not too long ago that was
> 15 minutes. And before that it was about three hours (with swapping). An
> index that took nine hours on sunsite now takes about 15 minutes, if I
> remember correctly.
>
Agree, i have other documentation which are indexed very fast , remember my
previous comparaison between SWISH-E and HTDIG.
> Swish doesn't get pushed into the millions of files, or millions of words
> too often, so these speed issues don't show up much. But, it's a good time
> to try to optimize even more. So, if you can find anything that improves
> the indexing, then that's great.
>
> BTW -- Jose has tested using a btree-type of indexing scheme. I'm not sure
> of it's current state, but you might (for fun) try indexing with:
>
> #define USE_BTREE
>
> in config.h. I don't think you can use -e with btree.
>
Ok, i will try it, for fun :-)
Thanks,
Jean-François
Received on Thu Dec 27 01:48:30 2001