Skip to main content.
home | support | download

Back to List Archive

Antw: [SWISH-E:419] Re: indexing PDF

From: Rainer Scherg RTC <Rainer.Scherg(at)>
Date: Tue Aug 11 1998 - 08:25:00 GMT
Patrick Fitzgerald wrote:


> >
> >I've made some enhancements to swish-e 1.1 to index Non-Text or HTML
> >files (e.g. to get PDF-files indexed) [I've sent the code changes to
> > Roy].
> Could you describe the code changes?

Starting a filter program as child process. The output of the filter prog 
will be piped to the swish-e process. It was a minor change to the "Index 
the current file" - routine of swish-e. It was more work to do to build in 
the config file directives.

>  Do you directly index the PDF files?

Yes. - I've implemented a FileFilter option, which enable you to include
filters for any filetype.
e.g. for PDF the entry in the config file:

     FileFilter  .pdf

The - prog is very simple:

pdftotext "$1" - 2>/dev/null

... using the xpdf utility  (pdftotext).

> To index PDF files, I implemented the following workaround:
> 1. For every PDF file (for example, "myfile.pdf"), create a file
> "myfile.pdf.html" that contains the plain text to be indexed.
> [...]

That's is to complicated to handle for me in practice. ;-)
The filter progs have to convert the contents of a file (pdf, word, xls)
to standard text and printing it on STDOUT.

cu -- Rainer
Received on Tue Aug 11 01:36:42 1998