Skip to main content.
home | support | download

Back to List Archive

Re: [SWISH-E:155] Re: Two minor suggestions

From: Roy Tennant <rtennant(at)not-real.library.berkeley.EDU>
Date: Fri Feb 27 1998 - 19:01:07 GMT
Paul,
Without understanding the internals, your version sounds interesting, but
one of the things that SWISH-E does that is absolutely essential to my
applications is that it indexes the contents of META tags such that
searches can be limited to what is in a specific META tag. Does SWISH++ do
that? Thanks,
Roy

On Fri, 27 Feb 1998, Paul J. Lucas wrote:

> On Mon, 23 Feb 1998, Jacques Delsemme wrote:
> 
> > 1- When indexing, swish-e goes very fast at first, then slows down more 
> > and more, until it literally crawls when you have a lot of data to 
> > index.  It would be great to be have it report periodically on its use 
> > of system resources to be able to learn where the bottleneck is located.
> 
> 	It's eating up memory and your machine is swapping.  Also, from
> 	looking at the code, it appears to use unbalanced binary trees
> 	for the words (although it uses hashes for most everything
> 	else).
> 
> 	Because of this and many other limitaions of SWISH-E, I've
> 	written SWISH++.  I'm putting the finishing touches on the docs
> 	now, so it should be available next week sometime.  Briefly
> 	(from the in-progress README):
> 
> 	1. 8-10 times faster at indexing.  It achieves this speed by using:
> 		a) mmap(2) instead of stdio to read files
> 		b) very little explicit dynamic memory allocation
> 		c) more inlining and fewer function calls in inner loops
> 		d) better data structures and algorithms by virtue of
> 		   using STL (The C++ Standard Template Library), e.g.,
> 		   maps rather than linked lists
> 
> 	2. Better results format of:
> 
> 		rank path_name file_size file_title
> 
> 	   By placing the file_title, which may contain spaces, last,
> 	   you can easily parse it, e.g.:
> 
> 		($rank,$path,$size,$title) = split( / /, $_, 4 );
> 
> --->	3. Automatically splits and remerges large file sets.
> 
> 	4. Parses hexadecimal numeric character entity references of
> 	   the form "&xhhh;" in addition to decimal ones.
> 
> 	5. Searches are practically instantaneous because the index
> 	   file is mmap(2)'ed and binary-searchable immediately.
> 
> 	For example, on a SPARC Ultra 2, it indexes 5 million words (1
> 	million unique) in just under 8 minutes.  Smokin'!
> 
> 	- Paul J. Lucas
> 	  NASA Ames Research Center		Caelum Research Corporation
> 	  Moffett Field, California		San Jose, California
> 	  <pjl AT ptolemy DOT arc DOT nasa DOT gov>
> 
> 
Received on Fri Feb 27 11:08:42 1998