This is a multi-part message in MIME format.
--------------DE3D038C31D82805D97AA2CB
Content-Type: multipart/alternative;
boundary="------------BA9D3AAE491A2070C0D4EF52"
--------------BA9D3AAE491A2070C0D4EF52
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
I had that problem when running using pdf2html.pl on 0 length pdf files. When
it ran into on 0 length pdf file and got the errors the whole swish run would
quit. I solved that problem by modifying pdf2html.pl to include some error
checking. Are you sure you don't have any empty pdf files? I am attaching my
pdf2html.pl file.
Brad_Horstkotte@capgroup.com wrote:
> (reposting to add the subject line that I forgot the first time...)
>
> I've been poking around trying to figure out how to get PDF indexing to
> work, and haven't had any luck - I'm running into the same problem which
> was discussed on this thread (null characters in the PDF files being
> replaced with line feed characters, and later on the PDF is seen as
> invalid):
>
> http://swish-e.org/archive/4511.html
>
> Has this problem been fixed?
>
> The PDFs convert fine when running _pdf2html.pl from the command line on
> the file, but fail when converted via the spider.
>
> I am running on Windows 2000; here is my configuration:
>
> ----------
>
> IndexDir spider.pl
> SwishProgParameters default http://L0053022/index.htm
> IndexOnly .htm .html .pdf
>
> StoreDescription HTML* <body> 10000
> FuzzyIndexingMode Stemming_en2
> MetaNames description keywords
> PropertyNames description keywords
>
> IndexReport 3
> ParserWarnLevel 1
>
> FilterDir /SWISH-E/lib/swish-e
> FileFilter .pdf _pdf2html.pl '"%p" -'
>
> ----------
>
> .and here are the errors I get when doing the PDF conversion via the
> spider:
>
> ----------
>
> http://l0053022/pdf/shareholder/editable/afd-103_aflink.pdf - Using HTML2
> parser - Error: May not be a PDF file (continuing anyway)
> Error (0): PDF file is damaged - attempting to reconstruct xref table...
> Error: Couldn't find trailer dictionary
> Error: Couldn't read xref table
> C:\SWISH-E\lib\swish-e\_pdf2html.pl: Failed close on pipe to pdfinfo for
> C:\TEMP \swtmpfltrcnaaaa: 256 at C:\SWISH-E\lib\swish-e\_pdf2html.pl line
> 54.
> (no words indexed)
>
> ----------
>
> I saw SWISH::Filter mentioned as an alternative, but so far have avoided it
> since I'm a perl dolt, and it looked like less of a turnkey alternative.
>
> Thanks in advance - Brad
--
Suzanne Smolowitz, Raytheon Systems Company
suzanne@sels.rsc.raytheon.com
--------------BA9D3AAE491A2070C0D4EF52
Content-Type: text/html; charset=us-ascii
Content-Transfer-Encoding: 7bit
<!doctype html public "-//w3c//dtd html 4.0 transitional//en">
<html>
<body text="#00004E" bgcolor="#CCCCCC" link="#3333FF" vlink="#165E46" alink="#7700A3">
I had that problem when running using pdf2html.pl on 0 length pdf files.
When it ran into on 0 length pdf file and got the errors the whole swish
run would quit. I solved that problem by modifying pdf2html.pl to
include some error checking. Are you sure you don't have any empty
pdf files? I am attaching my pdf2html.pl file.
<p>Brad_Horstkotte@capgroup.com wrote:
<blockquote TYPE=CITE>(reposting to add the subject line that I forgot
the first time...)
<p>I've been poking around trying to figure out how to get PDF indexing
to
<br>work, and haven't had any luck - I'm running into the same problem
which
<br>was discussed on this thread (null characters in the PDF files being
<br>replaced with line feed characters, and later on the PDF is seen as
<br>invalid):
<p><a href="http://swish-e.org/archive/4511.html">http://swish-e.org/archive/4511.html</a>
<p>Has this problem been fixed?
<p>The PDFs convert fine when running _pdf2html.pl from the command line
on
<br>the file, but fail when converted via the spider.
<p>I am running on Windows 2000; here is my configuration:
<p>----------
<p>IndexDir spider.pl
<br>SwishProgParameters default <a href="http://L0053022/index.htm">http://L0053022/index.htm</a>
<br>IndexOnly .htm .html .pdf
<p>StoreDescription HTML* <body> 10000
<br>FuzzyIndexingMode Stemming_en2
<br>MetaNames description keywords
<br>PropertyNames description keywords
<p>IndexReport 3
<br>ParserWarnLevel 1
<p>FilterDir /SWISH-E/lib/swish-e
<br>FileFilter .pdf _pdf2html.pl '"%p" -'
<p>----------
<p>.and here are the errors I get when doing the PDF conversion via the
<br>spider:
<p>----------
<p><a href="http://l0053022/pdf/shareholder/editable/afd-103_aflink.pdf">http://l0053022/pdf/shareholder/editable/afd-103_aflink.pdf</a>
- Using HTML2
<br>parser - Error: May not be a PDF file (continuing anyway)
<br>Error (0): PDF file is damaged - attempting to reconstruct xref table...
<br>Error: Couldn't find trailer dictionary
<br>Error: Couldn't read xref table
<br>C:\SWISH-E\lib\swish-e\_pdf2html.pl: Failed close on pipe to pdfinfo
for
<br>C:\TEMP \swtmpfltrcnaaaa: 256 at C:\SWISH-E\lib\swish-e\_pdf2html.pl
line
<br>54.
<br> (no words indexed)
<p>----------
<p>I saw SWISH::Filter mentioned as an alternative, but so far have avoided
it
<br>since I'm a perl dolt, and it looked like less of a turnkey alternative.
<p>Thanks in advance - Brad</blockquote>
<pre>--
Suzanne Smolowitz, Raytheon Systems Company
suzanne@sels.rsc.raytheon.com</pre>
</body>
</html>
--------------BA9D3AAE491A2070C0D4EF52--
--------------DE3D038C31D82805D97AA2CB
Content-Type: text/plain; charset=us-ascii;
name="pdf2html.pm"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
filename="pdf2html.pm"
package pdf2html;
use strict;
=pod
=head1 NAME
pdf2html - swish-e sample module to convert pdf to html
=head1 SYNOPSIS
use pdf2html;
my $html_record_ref = pdf2html( $pdf_file_name, 'title' );
# or by passing content in a scalar reference
my $html_text_ref = pdf2html( \$pdf_content, 'title' );
=head1 DESCRIPTION
Sample module for use with other swish-e 'prog' document source programs.
Pass either a file name, or a scalar reference.
The differece is when you pass a reference to a scalar
only the content is returned. When you pass a file name
an entire record is returned ready to be fed to swish -- this
includes the headers required by swish for indexing.
The second optional parameter is the extracted PDF info tag to use as the HTML title.
The plan is to find a library that will do this to avoid forking an external
program.
=head1 REQUIREMENTS
Uses the xpdf package that includes the pdftotext conversion program.
This is available from http://www.foolabs.com/xpdf/xpdf.html.
You will also need the module File::Temp (and its dependencies)
available from CPAN if passing content to this module (instead of a file name).
=head1 AUTHOR
Bill Moseley
=cut
use Symbol;
use vars qw(
@ISA
@EXPORT
$VERSION
);
# $Id: pdf2html.pm,v 1.5 2002/09/11 00:53:44 whmoseley Exp $
$VERSION = sprintf '%d.%02d', q$Revision: 1.5 $ =~ /: (\d+)\.(\d+)/;
require Exporter;
@ISA = qw(Exporter);
@EXPORT = qw(pdf2html);
if ( $0 eq 'pdf2html.pm' ) {
my $file = shift || die "Usage: perl pdf2html.pm file.pdf [title tag]\n";
my $title = shift;
print ${pdf2html( $file, $title )};
}
sub pdf2html {
my $file_or_content = shift;
my $title_tag = shift;
my $file = ref $file_or_content
? create_temp_file( $file_or_content )
: $file_or_content;
my $metadata = get_pdf_headers( $file );
#ADD SOME ERROR CHECKING FOR BAD PDF FILES
if (!$metadata) {
$metadata = "";
return \$metadata;
}
my $headers = format_metadata( $metadata );
if ( $title_tag && exists $metadata->{ $title_tag } ) {
my $title = escapeXML( $metadata->{ $title_tag } );
$headers = "<title>$title</title>\n" . $headers
}
# Check for encrypted content
my $content_ref;
# patch provided by Martial Chartoire
if ( $metadata->{encrypted} && $metadata->{encrypted} =~ /yes\.*\scopy:no\s\.*/i ) {
$content_ref = \'';
} else {
$content_ref = get_pdf_content_ref( $file );
}
#ADD SOME ERROR CHECKING FOR BAD PDF FILES
if ( !$content_ref ) {
$content_ref = "";
return $content_ref;
}
my $txt = <<EOF;
<html>
<head>
$headers
</head>
<body>
<pre>
$$content_ref
</pre>
</body>
</html>
EOF
if ( ref $file_or_content ) {
unlink $file;
return \$txt;
}
my $mtime = (stat $file )[9];
my $size = length $txt;
my $ret = <<EOF;
Content-Length: $size
Last-Mtime: $mtime
Path-Name: $file
EOF
$ret .= $txt;
return \$ret;
}
sub get_pdf_headers {
my $file = shift;
my $sym = gensym;
open $sym, "/local/transwm/search_engine/prog_bin/pdfinfo $file |" || warn "$0: Failed to open $file $!";
my %metadata;
while (<$sym>) {
if ( /^\s*([^:]+):\s+(.+)$/ ) {
my ( $metaname, $value ) = ( lc( $1 ), $2 );
$metaname =~ tr/ /_/;
$metadata{$metaname} = $value;
}
}
close $sym or warn "$0: Failed close on pipe to pdfinfo for $file: $?";
return \%metadata;
}
sub format_metadata {
my $metadata = shift;
my $metas = join "\n", map {
qq[<meta name="$_" content="] . escapeXML( $metadata->{$_} ) . '">';
} sort keys %$metadata;
return $metas;
}
sub get_pdf_content_ref {
my $file = shift;
my $sym = gensym;
open $sym, "/local/transwm/search_engine/prog_bin/pdftotext $file - |" or warn "$0: failed to run pdftotext: $!";
local $/ = undef;
my $content = escapeXML(<$sym>);
close $sym or warn "$0: Failed close on pipe to pdftotext for $file: $?";
return \$content;
}
# How are URLs printed with pdftotext?
sub escapeXML {
my $str = shift;
for ( $str ) {
s/</</go;
s/>/>/go;
tr/\014/ /; # ^L
# s/&/&/go;
# s/"/"/go;
}
return $str;
}
# This is the portable way to do this, I suppose.
# Otherwise, just create a file in the local directory.
sub create_temp_file {
my $scalar_ref = shift;
require "File/Temp.pm";
my ( $fh, $file_name ) = File::Temp::tempfile();
print $fh $$scalar_ref or warn $!;
close $fh or warn "Failed to close '$file_name' $!";
return $file_name;
}
1;
--------------DE3D038C31D82805D97AA2CB--
Received on Wed Dec 17 13:05:45 2003