Skip to main content.
home | support | download

Back to List Archive

RE: Hit-highlighting of PDF files

From: Herman Knoops <hk.sw(at)not-real.knoman.com>
Date: Thu Jun 29 2006 - 10:18:55 GMT
> Humm... A search engine independent solution! That's an 
> interesting idea. So you actually keep a 2nd file (.LST)
> for *each* pdf you include in the index?
> Of course this forces one to double the required disk space and 
> maintain the "DB" (the .LST files), but the flexibility 
> it provides is definitely an asset.

If you have thousands of documents, a DB could be an option.
Our DBs are often only a thousand documents (with a total
of 50000 A4-pages, full text).

 
> Have you considered using an off the shelf database (i.e. MySQL) 
> instead of the .LST files? I'm not sure it would be a good idea,
> but the PDF files I will be indexing are huge (hundreds of MB
> each) and I am concerned with the access time of doing 2 searches
> for each user query (first in swish and then for the 
> LST lookup). Any thoughts?

Suggest to use a 2-step approach. First the user performs a
Swish-E search, which gives a list of matching documents.
(response most of the time less than a second).

Next, if a user clicks to open the PDF-file, you do the
LST-lookup and return the PDF and the generated 
"pseude-xml" file. Assuming you have:
1) pdf-files saved as "web optimised / linearised;
2) setup your browser correctly for Acrobat;
3) a server which has "byte ranges" support;
then this second step is also relative fast, since
even large PDF-files are served "page by page" (in
our case often only 10 KBytes per page). 

Herman Knoops
KnoMan.com
Received on Thu Jun 29 03:18:57 2006