Skip to main content.
home | support | download

Back to List Archive

Re: Storing Descriptions from both META and BODY

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Fri Oct 04 2002 - 17:10:12 GMT
Sending back to the list -- as others may be in a similar situation or have
a better idea.

At 12:29 PM 10/04/02 -0400, Jeffrey.Grunstein@ny.frb.org wrote:
>
>Here's another problem we're having.
>
>We're storing <BODY> text for HTML files (the first 2500 characters) and
>the first 2500 characters of PDFs.
>
>2500 characters is enough for larger HTML and PDF files.  If a search term
>is found after the 2500th character, it won't show
>up in the search results and won't be highlighted. (some of our PDFs are
>huge)
>
>So we cranked those numbers up 250,000, with the intention of storing the
>full documents.
>Sure enough, it worked, but performance also slowed to a crawl.

Try first turning off the highlighting code.  I posted a day ago how the
phrase highlighting module works and it's very processing intensive.  If
turning it off has reasonable speed then try the other highlighting
modules.  You might find one that is reasonable.  Maybe SimpleHighlight.pm
will work well enough for you (it won't highlight phrases, IIRC).

You might see a big difference if you ran on a machine with a very fast
CPU.  I'm reasonably amazed how fast my inexpensive Athlon runs.  Can you
test on a different machine?

If your PDF files are huge, then another option is to split them up and
index them in chunks.  That has the advantage that your results will be
targeted to specific sections so if "foo" is only found in one paragraph in
the document it only has to search through a small amount of text to find
the words to highlight.  But also you can get multiple hits for each file,
which may seem confusing especially since the links would just go back to
the PDF file.  Works better with HTML where you can use <a name> tags.  Try
searching http://perl.apache.org for an example (of HTML indexed in sections).

>We're trying to get it to work under mod_perl, but haven't been able to
>yet.  Would mod_perl alone
>make enough of a difference that speed would be acceptable even at 250,000?

No.  mod_perl makes the CGI script run much much faster (and thus mostly
able to handle more requests per second, and allows you to make *searching*
faster, but won't help in highlighting results.

For a bit of work, you could even pre-process the descriptions into the
structures that the highlighting code uses and cache those on disk.  That
might help a little.  But when it comes down to it it's just slow work.

Please post back what you find works best for you, ok?

Optimizations welcome! 


-- 
Bill Moseley
mailto:moseley@hank.org
Received on Fri Oct 4 17:14:22 2002