Skip to main content.
home | support | download

Back to List Archive

RE: Document Summaries/Descriptions

From: <Rainer.Scherg(at)not-real.rexroth.de>
Date: Wed Nov 15 2000 - 16:43:23 GMT
Hi Bill,

Mhh, a description of 3K is IMO to long to display.

What I have in mind was not an abstract or a complete description of a
document,
but a short "intro scan" of the first words like other search engines does
(like altavista results...).


Also the mentioned pre-processing is no topic for me - it's IMO to
complicated
to handle. The only thing we would do is storing the information outside
swish-e.

This would result in several additional processes:
  - the "shadow" process (to shadow the filtered data as e.g. txt file)
  - the enhanced search process, which checks for each result item, if there
    is a shadow file and retrieve the data...


>SWISH should stay focused on speed.  We don't want to have to rename it
SNAIL.

8-/ Yep aggreed, we can do the following:

  StoreDescription 0:
      doesn't store any description and is default.
      The only overhead might be an empty pointer in this structure.
      This should be no speedloss at all...

  Also we could implement this new function with an #define
USE_DESCRIPTION_STORE
  conditional compilation path. In this way everyone could use the
swish-version
  he wants to use.

  But IMO we should implement it and see what benchmark tests will  show.
  If the there is a slowdown, we can include the cond. compilation code
  (or get rid of this function)...


cu - rainer



-----Original Message-----
From: Bill Moseley [mailto:moseley@hank.org]
Sent: Wednesday, November 15, 2000 4:18 PM
To: Multiple recipients of list
Subject: [SWISH-E] RE: Document Summaries/Descriptions

[...]
sense.  The description I use are often 3K or more so you could see where
the index might not be the best place for me.  Plus, the descriptions have
fields, so the descriptions would need to be split during processing anyway.

>Retrieving this info by an external process (e.g. search.cgi)
>will have an impact on the server load. In our case we cannot
>provide an text extract of thousands of pdf and doc files.
>An online filter call per file to get this information will
>IMO slow down the search process to a "non-acceptable"...

I have meta-tag style documents that I extract descriptions from and it's
quite fast, but I don't have to convert from .doc or .pdf on the fly.  You
wouldn't want to do that -- much better to pre-process that at indexing
time if you have the disk space.

I'm not sure what "fast" means, but on this little Linux machine I can get
almost ten requests per second using Apache Benchmark using mod_perl and
the library version of SWISH and extracting about 20 descriptions per page
(per request) from the source files.  

That drops to less that one per second if using CGI instead of mod_perl.
CGI is not a good choice for busy sites or where you want snappy response
time, of course.

Now, if I remember, running just "hello world" with mod_perl I get about
800/second.  And running my swish program without calling swish but with
accessing a bunch of template files and other source documents I can get
100/second.  So most of that slowness (if you can call 10/second slow!)
seems to be swish.

The other fast option is to use a database to store the descriptions that
have be pre-extracted from the source documents.  This is more flexible
than storing on disk, but it's unclear if it's any faster than simply using
the file system.

Don't get me wrong.  I totally agree that a way to store descriptions in
the index is a great idea.  I was just wondering if Properties might work,
and commenting that storing them outside the index might be more flexible
in general.

SWISH should stay focused on speed.  We don't want to have to rename it
SNAIL.


Bill Moseley
mailto:moseley@hank.org


----------------------------------------------------------------------
This Mail has been checked for Viruses
Attention: Encrypted Mails can NOT be checked !

* * *

Diese Mail wurde auf Viren ueberprueft
Hinweis: Verschluesselte Mails koennen NICHT geprueft werden !
----------------------------------------------------------------------
Received on Wed Nov 15 16:44:54 2000