Skip to main content.
home | support | download

Back to List Archive

problem indexing PDFs - "Error (0): PDF file is damaged"

From: <Brad_Horstkotte(at)not-real.capgroup.com>
Date: Tue Dec 16 2003 - 22:20:16 GMT
(reposting to add the subject line that I forgot the first time...)

I've been poking around trying to figure out how to get PDF indexing to
work, and haven't had any luck - I'm running into the same problem which
was discussed on this thread (null characters in the PDF files being
replaced with line feed characters, and later on the PDF is seen as
invalid):

http://swish-e.org/archive/4511.html

Has this problem been fixed?

The PDFs convert fine when running _pdf2html.pl from the command line on
the file, but fail when converted via the spider.

I am running on Windows 2000; here is my configuration:

----------

IndexDir spider.pl
SwishProgParameters default http://L0053022/index.htm
IndexOnly .htm .html .pdf

StoreDescription HTML* <body> 10000
FuzzyIndexingMode Stemming_en2
MetaNames description keywords
PropertyNames description keywords

IndexReport 3
ParserWarnLevel 1

FilterDir /SWISH-E/lib/swish-e
FileFilter .pdf _pdf2html.pl '"%p" -'

----------

.and here are the errors I get when doing the PDF conversion via the
spider:

----------

http://l0053022/pdf/shareholder/editable/afd-103_aflink.pdf - Using HTML2
parser - Error: May not be a PDF file (continuing anyway)
Error (0): PDF file is damaged - attempting to reconstruct xref table...
Error: Couldn't find trailer dictionary
Error: Couldn't read xref table
C:\SWISH-E\lib\swish-e\_pdf2html.pl: Failed close on pipe to pdfinfo for
C:\TEMP \swtmpfltrcnaaaa: 256 at C:\SWISH-E\lib\swish-e\_pdf2html.pl line
54.
 (no words indexed)

----------

I saw SWISH::Filter mentioned as an alternative, but so far have avoided it
since I'm a perl dolt, and it looked like less of a turnkey alternative.

Thanks in advance - Brad
Received on Tue Dec 16 22:20:26 2003