Skip to main content.
home | support | download

Back to List Archive

More Trouble with Filters

From: Klingensmith, Rick <klingensmith(at)>
Date: Mon Jul 28 2003 - 21:20:08 GMT
I'm continuing to have a problem with filters. I'm in a windows 2000/XP
environment and am using the spider to crawl my site which contains pdf
files. Pdfinfo and pdftotext are installed and working from the command


For each pdf file indexed I receive the following error:


Returned 0

 - Using DEFAULT (HTML2) parser - Error: May not be a PDF file (continuing

Error (0): PDF file is damaged - attempting to reconstruct xref table...

Error: Couldn't find trailer dictionary

Error: Couldn't read xref table

 (no words indexed)



I modified swishspider at line 144 to print the contents to stderr and
receive the following output for the meta tags for the document. As you can
see below I believe the meta tags from the output from pdfinfo are not being
formed properly. I just can't figure out why.


- Using DEFAULT (HTML2) parser -  (23 words)

retrieving http://localhost/affidavit.pdf (1)...

spider 2376 [C:/Inetpub/Indexes/Temp/swishspider@3084




">eta name="author" content="jamin

">eta name="creationdate" content="04/23/03 10:40:15

">eta name="creator" content="Affidavit final.doc - Microsoft Word

">eta name="encrypted" content="no

">eta name="file_size" content="31838 bytes

">eta name="moddate" content="04/23/03 10:47:36

">eta name="optimized" content="yes

">eta name="page_size" content="612 x 792 pts (letter)

">eta name="pages" content="1

">eta name="pdf_version" content="1.4

">eta name="producer" content="Acrobat PDFWriter 5.0 for Windows NT

">eta name="tagged" content="no

">eta name="title" content="Affidavit final.doc




The contents of the document appear to be OK from what I can see. 


Have I missed something obvious or do you need me to post the configuration
files as well.




Richard Klingensmith

MSU Human Resources Information Systems

1407 S. Harrison Road Ste. 40

East Lansing, MI 48823

(517) 432-4636 ext. 155


Due to deletion of content types excluded from this list by policy,
this multipart message was reduced to a single part, and from there
to a plain text message.
Received on Mon Jul 28 21:20:20 2003