Skip to main content.
home | support | download

Back to List Archive

Re: problem indexing PDFs - "Error (0): PDF file is damaged"

From: Suzanne Smolowitz <suzanne(at)not-real.slpmbo.ed.ray.com>
Date: Wed Dec 17 2003 - 13:05:37 GMT
This is a multi-part message in MIME format.
--------------DE3D038C31D82805D97AA2CB
Content-Type: multipart/alternative;
 boundary="------------BA9D3AAE491A2070C0D4EF52"


--------------BA9D3AAE491A2070C0D4EF52
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

I had that problem when running using pdf2html.pl on 0 length pdf files.  When
it ran into on 0 length pdf file and got the errors the whole swish run would
quit.  I solved that problem by modifying pdf2html.pl to include some error
checking.  Are you sure you don't have any empty pdf files?  I am attaching my
pdf2html.pl file.

Brad_Horstkotte@capgroup.com wrote:

> (reposting to add the subject line that I forgot the first time...)
>
> I've been poking around trying to figure out how to get PDF indexing to
> work, and haven't had any luck - I'm running into the same problem which
> was discussed on this thread (null characters in the PDF files being
> replaced with line feed characters, and later on the PDF is seen as
> invalid):
>
> http://swish-e.org/archive/4511.html
>
> Has this problem been fixed?
>
> The PDFs convert fine when running _pdf2html.pl from the command line on
> the file, but fail when converted via the spider.
>
> I am running on Windows 2000; here is my configuration:
>
> ----------
>
> IndexDir spider.pl
> SwishProgParameters default http://L0053022/index.htm
> IndexOnly .htm .html .pdf
>
> StoreDescription HTML* <body> 10000
> FuzzyIndexingMode Stemming_en2
> MetaNames description keywords
> PropertyNames description keywords
>
> IndexReport 3
> ParserWarnLevel 1
>
> FilterDir /SWISH-E/lib/swish-e
> FileFilter .pdf _pdf2html.pl '"%p" -'
>
> ----------
>
> .and here are the errors I get when doing the PDF conversion via the
> spider:
>
> ----------
>
> http://l0053022/pdf/shareholder/editable/afd-103_aflink.pdf - Using HTML2
> parser - Error: May not be a PDF file (continuing anyway)
> Error (0): PDF file is damaged - attempting to reconstruct xref table...
> Error: Couldn't find trailer dictionary
> Error: Couldn't read xref table
> C:\SWISH-E\lib\swish-e\_pdf2html.pl: Failed close on pipe to pdfinfo for
> C:\TEMP \swtmpfltrcnaaaa: 256 at C:\SWISH-E\lib\swish-e\_pdf2html.pl line
> 54.
>  (no words indexed)
>
> ----------
>
> I saw SWISH::Filter mentioned as an alternative, but so far have avoided it
> since I'm a perl dolt, and it looked like less of a turnkey alternative.
>
> Thanks in advance - Brad

--
Suzanne Smolowitz, Raytheon Systems Company
suzanne@sels.rsc.raytheon.com



--------------BA9D3AAE491A2070C0D4EF52
Content-Type: text/html; charset=us-ascii
Content-Transfer-Encoding: 7bit

<!doctype html public "-//w3c//dtd html 4.0 transitional//en">
<html>
<body text="#00004E" bgcolor="#CCCCCC" link="#3333FF" vlink="#165E46" alink="#7700A3">
I had that problem when running using pdf2html.pl on 0 length pdf files.&nbsp;
When it ran into on 0 length pdf file and got the errors the whole swish
run would quit.&nbsp; I solved that problem by modifying pdf2html.pl to
include some error checking.&nbsp; Are you sure you don't have any empty
pdf files?&nbsp; I am attaching my pdf2html.pl file.
<p>Brad_Horstkotte@capgroup.com wrote:
<blockquote TYPE=CITE>(reposting to add the subject line that I forgot
the first time...)
<p>I've been poking around trying to figure out how to get PDF indexing
to
<br>work, and haven't had any luck - I'm running into the same problem
which
<br>was discussed on this thread (null characters in the PDF files being
<br>replaced with line feed characters, and later on the PDF is seen as
<br>invalid):
<p><a href="http://swish-e.org/archive/4511.html">http://swish-e.org/archive/4511.html</a>
<p>Has this problem been fixed?
<p>The PDFs convert fine when running _pdf2html.pl from the command line
on
<br>the file, but fail when converted via the spider.
<p>I am running on Windows 2000; here is my configuration:
<p>----------
<p>IndexDir spider.pl
<br>SwishProgParameters default <a href="http://L0053022/index.htm">http://L0053022/index.htm</a>
<br>IndexOnly .htm .html .pdf
<p>StoreDescription HTML* &lt;body> 10000
<br>FuzzyIndexingMode Stemming_en2
<br>MetaNames description keywords
<br>PropertyNames description keywords
<p>IndexReport 3
<br>ParserWarnLevel 1
<p>FilterDir /SWISH-E/lib/swish-e
<br>FileFilter .pdf _pdf2html.pl '"%p" -'
<p>----------
<p>.and here are the errors I get when doing the PDF conversion via the
<br>spider:
<p>----------
<p><a href="http://l0053022/pdf/shareholder/editable/afd-103_aflink.pdf">http://l0053022/pdf/shareholder/editable/afd-103_aflink.pdf</a>
- Using HTML2
<br>parser - Error: May not be a PDF file (continuing anyway)
<br>Error (0): PDF file is damaged - attempting to reconstruct xref table...
<br>Error: Couldn't find trailer dictionary
<br>Error: Couldn't read xref table
<br>C:\SWISH-E\lib\swish-e\_pdf2html.pl: Failed close on pipe to pdfinfo
for
<br>C:\TEMP \swtmpfltrcnaaaa: 256 at C:\SWISH-E\lib\swish-e\_pdf2html.pl
line
<br>54.
<br>&nbsp;(no words indexed)
<p>----------
<p>I saw SWISH::Filter mentioned as an alternative, but so far have avoided
it
<br>since I'm a perl dolt, and it looked like less of a turnkey alternative.
<p>Thanks in advance - Brad</blockquote>

<pre>--&nbsp;
Suzanne Smolowitz, Raytheon Systems Company
suzanne@sels.rsc.raytheon.com</pre>
&nbsp;
</body>
</html>

--------------BA9D3AAE491A2070C0D4EF52--

--------------DE3D038C31D82805D97AA2CB
Content-Type: text/plain; charset=us-ascii;
 name="pdf2html.pm"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
 filename="pdf2html.pm"

package pdf2html;
use strict;

=pod

=head1 NAME

pdf2html - swish-e sample module to convert pdf to html

=head1 SYNOPSIS

    use pdf2html;
    my $html_record_ref = pdf2html( $pdf_file_name, 'title' );

    # or by passing content in a scalar reference
    my $html_text_ref = pdf2html( \$pdf_content, 'title' );


    

=head1 DESCRIPTION

Sample module for use with other swish-e 'prog' document source programs.

Pass either a file name, or a scalar reference.

The differece is when you pass a reference to a scalar
only the content is returned.  When you pass a file name
an entire record is returned ready to be fed to swish -- this
includes the headers required by swish for indexing.

The second optional parameter is the extracted PDF info tag to use as the HTML title.



The plan is to find a library that will do this to avoid forking an external
program.

=head1 REQUIREMENTS

Uses the xpdf package that includes the pdftotext conversion program.
This is available from http://www.foolabs.com/xpdf/xpdf.html.

You will also need the module File::Temp (and its dependencies)
available from CPAN if passing content to this module (instead of a file name).


=head1 AUTHOR

Bill Moseley

=cut

use Symbol;


use vars qw(
    @ISA
    @EXPORT
    $VERSION
);

# $Id: pdf2html.pm,v 1.5 2002/09/11 00:53:44 whmoseley Exp $
$VERSION = sprintf '%d.%02d', q$Revision: 1.5 $ =~ /: (\d+)\.(\d+)/;

require Exporter;
@ISA    = qw(Exporter);
@EXPORT = qw(pdf2html);

if ( $0 eq 'pdf2html.pm' ) {
    my $file = shift || die "Usage: perl pdf2html.pm file.pdf [title tag]\n";
    my $title = shift;
    print ${pdf2html( $file, $title )};
}


sub pdf2html {
    my $file_or_content = shift;
    my $title_tag = shift;

    my $file = ref $file_or_content
    ? create_temp_file( $file_or_content )
    : $file_or_content;

    my $metadata = get_pdf_headers( $file );

    #ADD SOME ERROR CHECKING FOR BAD PDF FILES
    if (!$metadata) {
	$metadata = "";
	return \$metadata;
    }

    my $headers = format_metadata( $metadata );

    if ( $title_tag && exists $metadata->{ $title_tag } ) {
        my $title = escapeXML( $metadata->{ $title_tag } );

        $headers = "<title>$title</title>\n" . $headers
    }
    

    # Check for encrypted content

    my $content_ref;

    # patch provided by Martial Chartoire
    if ( $metadata->{encrypted} && $metadata->{encrypted} =~ /yes\.*\scopy:no\s\.*/i ) {
        $content_ref = \'';

    } else {
        $content_ref = get_pdf_content_ref( $file );
    }

    #ADD SOME ERROR CHECKING FOR BAD PDF FILES
    if ( !$content_ref ) {
        $content_ref = "";
	return $content_ref;
    }

    my $txt = <<EOF;
<html>    
<head>
$headers
</head>
<body>
<pre>
$$content_ref
</pre>
</body>
</html>
EOF

    if ( ref $file_or_content ) {
        unlink $file;
        return \$txt;
    }

    my $mtime  = (stat $file )[9];

    my $size = length $txt;

    my $ret = <<EOF;
Content-Length: $size
Last-Mtime: $mtime
Path-Name: $file

EOF

$ret .= $txt;

    return \$ret;
    

}

sub get_pdf_headers {

    my $file = shift;
    my $sym = gensym;

    open $sym, "/local/transwm/search_engine/prog_bin/pdfinfo $file |" || warn "$0: Failed to open $file $!";

    my %metadata;

    while (<$sym>) {
        if ( /^\s*([^:]+):\s+(.+)$/ ) {
            my ( $metaname, $value ) = ( lc( $1 ), $2 );
            $metaname =~ tr/ /_/;
            $metadata{$metaname} = $value;
        }
    }
    close $sym or warn "$0: Failed close on pipe to pdfinfo for $file: $?";

    return \%metadata;
}

sub format_metadata {

    my $metadata = shift;

    my $metas = join "\n", map {
        qq[<meta name="$_" content="] . escapeXML( $metadata->{$_} ) . '">';
    } sort keys %$metadata;


    return $metas;
}

sub get_pdf_content_ref {
    my $file = shift;

    my $sym = gensym;
    open $sym, "/local/transwm/search_engine/prog_bin/pdftotext $file - |" or warn "$0: failed to run pdftotext: $!";

    local $/ = undef;
    my $content = escapeXML(<$sym>);

    close $sym or warn "$0: Failed close on pipe to pdftotext for $file: $?";

    return \$content;
}



# How are URLs printed with pdftotext?
sub escapeXML {

   my $str = shift;

   for ( $str ) {
       s/</&lt;/go;
       s/>/&gt;/go;
       tr/\014/ /; # ^L
       # s/&/&amp;/go;
       # s/"/&quot;/go;
    }
   return $str;
}

# This is the portable way to do this, I suppose.
# Otherwise, just create a file in the local directory.

sub create_temp_file {
    my $scalar_ref = shift;

    require "File/Temp.pm";

    my ( $fh, $file_name ) = File::Temp::tempfile();

    print $fh $$scalar_ref or warn $!;


    close $fh or warn "Failed to close '$file_name' $!";

    return $file_name;
}
    

1;


--------------DE3D038C31D82805D97AA2CB--
Received on Wed Dec 17 13:05:45 2003