Rainer Scherg RTC wrote:
>> Could you describe the code changes?
>
>Starting a filter program as child process. The output of the filter prog
>will be piped to the swish-e process. [...]
>
>The _pdf_filter.sh - prog is very simple:
>
>#!/bin/sh
>pdftotext "$1" - 2>/dev/null
>
>... using the xpdf utility (pdftotext).
Thanks for the pointer, I didn't know such a beast existed.
>> To index PDF files, I implemented the following workaround:
>>
>> 1. For every PDF file (for example, "myfile.pdf"), create a file
>> "myfile.pdf.html" that contains the plain text to be indexed.
>> [...]
>
>That's is to complicated to handle for me in practice. ;-)
>The filter progs have to convert the contents of a file (pdf, word, xls)
>to standard text and printing it on STDOUT.
I have a lot of large PDF files to be indexed, and pdftotext seems to be a
bit slow. I would hate to waste processor time converting the PDF to text
every time I want to update my search index.
So I created a script that searches my directories for PDF files, then
extracts the text into a .pdf.txt file (only if the .pdf.txt file does not
exist, or is older than the .pdf file). Thus I only have to extract the
text once, instead of every time I create the search index.
--
Patrick Fitzgerald, HP Internet and System Security Lab
http://issl.atl.hp.com/lab/employees/fitz/
fitz@issl.atl.hp.com -or- patrick_fitzgerald@hp.com
(do *not* use pat_fitzgerald@hp.com, that is not me)
Received on Tue Aug 11 16:39:57 1998