Skip to main content.
home | support | download

Back to List Archive

Re: [SWISH-E:419] Re: indexing PDF

From: Rick Beebe <BEEBE(at)not-real.BIOMED.MED.YALE.EDU>
Date: Mon Aug 10 1998 - 19:03:34 GMT
>Rainer Scherg RTC wrote:
>>I've made some enhancements to swish-e 1.1 to index Non-Text or HTML files
>>(e.g. to get PDF-files indexed) [I've sent the code changes to Roy].

>Could you describe the code changes?  Do you directly index the PDF files?

>To index PDF files, I implemented the following workaround:

>1. For every PDF file (for example, "myfile.pdf"), create a file
>"myfile.pdf.html" that contains the plain text to be indexed.

>2. When the search engine returns a hit on a myfile.pdf.html, change the
>reference to myfile.pdf.

>This works for other filetypes, such as Word files, etc.  The only
>disadvantage is that you must  create the separate HTML files.

We do a similar thing. We've got several manuals in PDF format. We use
Acrobat to spit out text versions of each PDF file which we put in a
different directory. Ie:

/manual/chapter1.pdf   <-- real pdf
/manual/txt/chapter1.pdf   <-- text equivalent

Then we use a "ReplaceRules" in swish.conf to replace /manual/txt with
/manual. By giving the text file the same name we don't have to deal with
any weird machinations later on and it works with any search interface we
want to use.


    Rick Beebe                                           (203) 785-4566
    Network Engineering Manager                     FAX: (203) 737-4037
    ITS-Med Technology Operations         
    Yale University School of Medicine                                 
    P.O. Box 208089, New Haven, CT 06520-8089
Received on Mon Aug 10 11:18:09 1998