Paul,
Without understanding the internals, your version sounds interesting, but
one of the things that SWISH-E does that is absolutely essential to my
applications is that it indexes the contents of META tags such that
searches can be limited to what is in a specific META tag. Does SWISH++ do
that? Thanks,
Roy
On Fri, 27 Feb 1998, Paul J. Lucas wrote:
> On Mon, 23 Feb 1998, Jacques Delsemme wrote:
>
> > 1- When indexing, swish-e goes very fast at first, then slows down more
> > and more, until it literally crawls when you have a lot of data to
> > index. It would be great to be have it report periodically on its use
> > of system resources to be able to learn where the bottleneck is located.
>
> It's eating up memory and your machine is swapping. Also, from
> looking at the code, it appears to use unbalanced binary trees
> for the words (although it uses hashes for most everything
> else).
>
> Because of this and many other limitaions of SWISH-E, I've
> written SWISH++. I'm putting the finishing touches on the docs
> now, so it should be available next week sometime. Briefly
> (from the in-progress README):
>
> 1. 8-10 times faster at indexing. It achieves this speed by using:
> a) mmap(2) instead of stdio to read files
> b) very little explicit dynamic memory allocation
> c) more inlining and fewer function calls in inner loops
> d) better data structures and algorithms by virtue of
> using STL (The C++ Standard Template Library), e.g.,
> maps rather than linked lists
>
> 2. Better results format of:
>
> rank path_name file_size file_title
>
> By placing the file_title, which may contain spaces, last,
> you can easily parse it, e.g.:
>
> ($rank,$path,$size,$title) = split( / /, $_, 4 );
>
> ---> 3. Automatically splits and remerges large file sets.
>
> 4. Parses hexadecimal numeric character entity references of
> the form "&xhhh;" in addition to decimal ones.
>
> 5. Searches are practically instantaneous because the index
> file is mmap(2)'ed and binary-searchable immediately.
>
> For example, on a SPARC Ultra 2, it indexes 5 million words (1
> million unique) in just under 8 minutes. Smokin'!
>
> - Paul J. Lucas
> NASA Ames Research Center Caelum Research Corporation
> Moffett Field, California San Jose, California
> <pjl AT ptolemy DOT arc DOT nasa DOT gov>
>
>
Received on Fri Feb 27 11:08:42 1998