Skip to main content.
home | support | download

Back to List Archive

RE: Document Summaries/Descriptions

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Wed Nov 15 2000 - 15:18:36 GMT
At 06:23 AM 11/15/00 -0800, Rainer.Scherg@rexroth.de wrote:
>We have @16000 docs and 4.5 Megs data volume without databases.
>16000 docs * 200 chars ~ @ 3-4 Megs add size
>
>Our swish index would increase from 40 Megs to 45 Megs.
>Storing the description along with the title and path, it
>will not slow down the search process, because this information
>doesn't need another hash (or whatever).

I suppose as long as the OS is smart enough to share the index between
processes that's not an issue and there's room in RAM for the index.  Makes
sense.  The description I use are often 3K or more so you could see where
the index might not be the best place for me.  Plus, the descriptions have
fields, so the descriptions would need to be split during processing anyway.

>Retrieving this info by an external process (e.g. search.cgi)
>will have an impact on the server load. In our case we cannot
>provide an text extract of thousands of pdf and doc files.
>An online filter call per file to get this information will
>IMO slow down the search process to a "non-acceptable"...

I have meta-tag style documents that I extract descriptions from and it's
quite fast, but I don't have to convert from .doc or .pdf on the fly.  You
wouldn't want to do that -- much better to pre-process that at indexing
time if you have the disk space.

I'm not sure what "fast" means, but on this little Linux machine I can get
almost ten requests per second using Apache Benchmark using mod_perl and
the library version of SWISH and extracting about 20 descriptions per page
(per request) from the source files.  

That drops to less that one per second if using CGI instead of mod_perl.
CGI is not a good choice for busy sites or where you want snappy response
time, of course.

Now, if I remember, running just "hello world" with mod_perl I get about
800/second.  And running my swish program without calling swish but with
accessing a bunch of template files and other source documents I can get
100/second.  So most of that slowness (if you can call 10/second slow!)
seems to be swish.

The other fast option is to use a database to store the descriptions that
have be pre-extracted from the source documents.  This is more flexible
than storing on disk, but it's unclear if it's any faster than simply using
the file system.

Don't get me wrong.  I totally agree that a way to store descriptions in
the index is a great idea.  I was just wondering if Properties might work,
and commenting that storing them outside the index might be more flexible
in general.

SWISH should stay focused on speed.  We don't want to have to rename it SNAIL.


Bill Moseley
mailto:moseley@hank.org
Received on Wed Nov 15 15:20:07 2000