Skip to main content.
home | support | download

Back to List Archive

PDF to HTML causing swish-e to crash

From: Greg Fenton <greg_fenton(at)not-real.yahoo.com>
Date: Thu Oct 10 2002 - 19:03:18 GMT
On RedHat 7.3, using the default RH xpdf-1.00-3 RPM install, I notice
that pdftotext goes to 100% CPU and spits out reams of output when run
from swish-e.

I am using _pdf2html.pl (from filter-bin).  When I run pdftotext
against all of my PDFs by hand (from bash), I have no problems.  But
when run from swish-e, I get:

    Error (65487): Bad uncompressed block length in flate stream
    Error (62305): Unexpected end of file in flate stream
    Error (52819): Bad code (7f4c) in flate stream
    Error (52819): Unexpected end of file in flate stream
    Error (55873): Wrong number (5) of args to 'c' operator
    [...]

for some files and then the following in a tight loop until I kill
swish-e:

    Error (109647): Dictionary key must be a name object
    Error (109647): Dictionary key must be a name object
    [...]

Looking at the output from pdftotext, I see that many of the
conversions contain garbage.  I suspect, upon reading the data, that it
is a font translation issue or something similar, though I have
next-to-no experience with creating PDFs.  (I haven't created the PDFs
on our site, but I believe that they are mostly created with MS-Word
and Acrobat).

My PDF configuration for swish-e:

    FileFilter  .pdf   /home/apache/bin/swish_pdf2html.pl "'%p'"
    IndexContents HTML2 .htm .html .shtml .pdf .doc .xls .ppt .ps

I tried another filter (pdf2html.pl from ht://Dig) but ended up with
similar results (it too uses pdftotext).

Anyone have better experience with pulling text out of PDFs?  I'd
really like to be able to index their contents, but right now I can't
run a successful index when I try.

Thanks in advance,
greg_fenton.

=====
Greg Fenton
greg_fenton@yahoo.com

__________________________________________________
Do you Yahoo!?
Faith Hill - Exclusive Performances, Videos & More
http://faith.yahoo.com
Received on Thu Oct 10 19:07:00 2002