Skip to main content.
home | support | download

Back to List Archive

Problem Indexing PDFs with

From: <Jeffrey.Grunstein(at)>
Date: Wed Dec 04 2002 - 15:09:16 GMT
I'm running Swish-E 2.2.1 on a Solaris 9 box.  I got a filesystem index
working flawlessly, with PDFs being parsed as TXT using pdftotext.

Now, I'm trying to get it working using the prog method and  The
crawl seems to works fine and HTML files get indexed using the HTML2
parser.  I cannot get PDF files to index correctly.  When I tried the pdf
function internal to, the PDF files were parsed as HTML2s and
between 5 and 8 words per file were indexed.  I know this is wrong because
the same PDF files with the filesystem index yield many more indexed

I also tried using pdftotext and that doesn't index any words.  Here's a
snippet from my swish-e config file.

IndexContents HTML2 .html .htm
StoreDescription HTML2 100000

FilterDir /opt/sfw/bin
FileFilter .pdf pdftotext "'%p' -"
IndexContents TXT .pdf
StoreDescription TXT 250000

Note that the same directives work perfectly when we do a filesystem index.
For some reason, they don't work with a prog / crawl.
Received on Wed Dec 4 15:09:28 2002