Skip to main content.
home | support | download

Back to List Archive

Re: [SWISH-E:419] Re: indexing PDF

From: Rick Beebe <BEEBE(at)not-real.BIOMED.MED.YALE.EDU>
Date: Mon Aug 10 1998 - 19:03:34 GMT
>Rainer Scherg RTC wrote:
>>
>>I've made some enhancements to swish-e 1.1 to index Non-Text or HTML files
>>(e.g. to get PDF-files indexed) [I've sent the code changes to Roy].

>Could you describe the code changes?  Do you directly index the PDF files?

>To index PDF files, I implemented the following workaround:

>1. For every PDF file (for example, "myfile.pdf"), create a file
>"myfile.pdf.html" that contains the plain text to be indexed.

>2. When the search engine returns a hit on a myfile.pdf.html, change the
>reference to myfile.pdf.

>This works for other filetypes, such as Word files, etc.  The only
>disadvantage is that you must  create the separate HTML files.

We do a similar thing. We've got several manuals in PDF format. We use
Acrobat to spit out text versions of each PDF file which we put in a
different directory. Ie:

/manual/chapter1.pdf   <-- real pdf
/manual/txt/chapter1.pdf   <-- text equivalent

Then we use a "ReplaceRules" in swish.conf to replace /manual/txt with
/manual. By giving the text file the same name we don't have to deal with
any weird machinations later on and it works with any search interface we
want to use.

  _______________________________________________________________________

    Rick Beebe                                           (203) 785-4566
    Network Engineering Manager                     FAX: (203) 737-4037
    ITS-Med Technology Operations                Richard.Beebe@yale.edu   
    Yale University School of Medicine                                 
    P.O. Box 208089, New Haven, CT 06520-8089
  _______________________________________________________________________
Received on Mon Aug 10 11:18:09 1998