On Mon, Jul 28, 2003 at 02:19:54PM -0700, Klingensmith, Rick wrote:
> I'm continuing to have a problem with filters. I'm in a windows 2000/XP
> environment and am using the spider to crawl my site which contains pdf
> files. Pdfinfo and pdftotext are installed and working from the command
> line.
That's good thing to know.
> For each pdf file indexed I receive the following error:
> Error (0): PDF file is damaged - attempting to reconstruct xref table...
>
> Error: Couldn't find trailer dictionary
>
> Error: Couldn't read xref table
Those are all messages coming from xpdf. So the next step is to modify
whatever is calling pdfinfo/pdftotext and see how it's being called.
> I modified swishspider at line 144 to print the contents to stderr and
> receive the following output for the meta tags for the document. As you can
> see below I believe the meta tags from the output from pdfinfo are not being
> formed properly. I just can't figure out why.
> <html>
>
> <head>
>
> ">eta name="author" content="jamin
>
> ">eta name="creationdate" content="04/23/03 10:40:15
>
> ">eta name="creator" content="Affidavit final.doc - Microsoft Word
>
> ">eta name="encrypted" content="no
>
> ">eta name="file_size" content="31838 bytes
>
> ">eta name="moddate" content="04/23/03 10:47:36
That's weird output. Looks like it's dropping some characters and
there's an extra blank line. Maybe DOS line endings are causing a
problem?
Hum, ok so you are using -S http with swishspider. Are you using the
SWISH::Filter module(s) to decode the pdf? Or are you using a
FileFilter directive (although I'm not sure that works).
If using the SWISH::Filter setup then I just added a use lib line to the
swishspider file to find the modules and ran:
moseley(at)not-real.bumby:~/apache$ ./swishspider swish http://localhost/apache/test.pdf
moseley@bumby:~/apache$ head swish.contents
<html>
<head>
<meta name="author" content=" ">
<meta name="creationdate" content="Fri Mar 21 21:42:23 2003">
<meta name="creator" content="Microsoft Word: AdobePS 8.7.3 (301)">
<meta name="encrypted" content="no">
<meta name="file_size" content="32194 bytes">
<meta name="moddate" content="Fri Mar 21 21:42:23 2003">
<meta name="optimized" content="yes">
<meta name="page_size" content="612 x 792 pts (letter)">
Can you duplicate that under Windows?
--
Bill Moseley
moseley@hank.org
Received on Mon Jul 28 21:57:52 2003