Skip to main content.
home | support | download

Back to List Archive

Re: PDF to HTML causing swish-e to crash

From: David L Norris <dave(at)not-real.webaugur.com>
Date: Thu Oct 10 2002 - 23:55:04 GMT
> Question: is anyone successfully indexing PDF documents on Linux with
> swish-e-2.2.1 ?  If so, can you please post your swish-e configuration
> indicating how you are filtering PDF to HTML (or text)?

PDF indexing works fine here...assuming the PDF is readable by xPDF. 
However, I have ran across quite a few PDFs that are not readable by
xPDF.  I believe that acrobat reader supports dumping PDF to text.  I've
thought about trying to use it in place of pdftotext.


$ uname -a
Linux Daneel.webaugur.com 2.4.9-31 #1 Tue Feb 26 06:23:51 EST 2002 i686
unknown

$ cat /etc/redhat-release 
Red Hat Linux release 7.1 (Seawolf)

$ swish-e -V
SWISH-E 2.3-dev-02

$ xpdf -v
xpdf version 0.92
Copyright 1996-2000 Derek B. Noonburg

$ pdfinfo -v
pdfinfo version 0.92
Copyright 1996-2000 Derek B. Noonburg

$ pdftotext -v
pdftotext version 0.92
Copyright 1996-2000 Derek B. Noonburg



$ cat _swish.conf 

# Misc Cruft
IndexName "Augury Library"
IndexDescription "Augury Library"
IndexPointer "http://library.webaugur.com/"
IndexAdmin "Augur <webmaster@library.webaugur.com>"

# Because I'm the admin, silly.
obeyRobotsNoIndex NO

# Default to HTML2 and specify XML2 and TXT2 explicitly
DefaultContents HTML2
IndexOnly .mpg .mpeg .mov .avi .au .wav .html .htm .php .mp3 .zip .gz
.tgz .tar .doc .pdf .xml .txt .exe .xls .dll
NoContents .mpg .mpeg .mov .avi .au .wav .gz .tgz .tar .exe .dll
IndexContents XML2 .mp3 .zip .xml
IndexContents TXT2 .doc .gz .txt 

# Document Descriptions
StoreDescription XML2 <desc> 64
StoreDescription HTML2 <body> 64
StoreDescription TXT2 48

# Let's translate our various files into something that makes sense
FileFilter .pdf /opt/swish-e/filter-bin/_pdf2html.pl "'%p'"
FileFilter .mp3 /opt/swish-e/filter-bin/_mp32xml2.sh "'%p'"
FileFilter .doc /usr/local/bin/catdoc " -s8859-1 -d8859-1 '%p'"
FileFilter .zip /opt/swish-e/filter-bin/_zipfiles.sh "'%p'"
FileFilter .php /usr/local/bin/php " -q '%p'"
FileFilter .xls /opt/swish-e/filter-bin/msexcel.php "'%p'"

# We want to be left with only the /Library/* portion of the filename.
ReplaceRules remove /home/public/Library


$ swish-e -c _swish.conf -i Electronics/semiconductors/Common_Parts/ -v3
Parsing config file '_swish.conf'
Indexing Data Source: "File-System"
Indexing "Electronics/semiconductors/Common_Parts/"

Checking dir "Electronics/semiconductors/Common_Parts"...
  nte923.pdf - Using HTML2 parser -  (658 words)
  nte5470_76.pdf - Using HTML2 parser -  (468 words)
  nte5061a.pdf - Using HTML2 parser -  (1224 words)
  nte1690.pdf - Using HTML2 parser -  (1329 words)
  nte130.pdf - Using HTML2 parser -  (492 words)
  nte180.pdf - Using HTML2 parser -  (433 words)
  nte245.pdf - Using HTML2 parser -  (352 words)
  nte283.pdf - Using HTML2 parser -  (394 words)
  nte5562_66.pdf - Using HTML2 parser -  (284 words)

Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 1070 words alphabetically
Writing header ...
Writing index entries ...
  Writing word text: Complete
  Writing word hash: Complete
  Writing word data: Complete
1070 unique words indexed.
5 properties sorted.                                              
9 files indexed.  236629 total bytes.  5634 total words.
Elapsed time: 00:00:01 CPU time: 00:00:00
Indexing done!


$ cat /opt/swish-e/filter-bin/_pdf2html.pl 
#! /usr/local/bin/perl -w
use strict;

# -- Filter PDF to simple HTML for swish
# --
# -- 2000-05  rasc
#
=pod

This filter requires two programs "pdfinfo" and "pdftotext".
These programs are part of the xpdf package found at
http://www.foolabs.com/xpdf/xpdf.html.

These programs must be found in the PATH when indexing is run, or 
explicitly set the path in this program:

  $ENV{PATH} = '/path/to/programs'

"pdfinfo" extracts the document info from a pdf file, if any exist,
and creates metanames for swish to index.  See man pdfinfo(1) for
information what keywords are available.

An HTML title is created from the "title" and "subject" pdf info data.
Adjust as needed below.

How the extracted keyword info is indexed in Swish-e is controlled by
the following Swish-e configuration settings: MetaNames, PropertyNames,
UndefinedMetaTags.

Passing the -raw option to pdftotext may improve indexing time by
reducing the size of the converted output.

=cut

my $file = shift || die "Usage: $0 <filename>\n";

#
# -- read pdf meta information
#

my %metadata;

open F, "pdfinfo $file |" || die "$0: Failed to open $file $!";

while (<F>) {
    if ( /^\s*([^:]+):\s+(.+)$/ ) {
        my ( $metaname, $value ) = ( lc( $1 ), escapeHTML( $2 ) );
        $metaname =~ tr/ /_/;
        $metadata{$metaname} = $value;
    }
}
close F or die "$0: Failed close on pipe to pdfinfo for $file: $?";


# Set the default title from the title and subject info

my @title = grep { $_ } @metadata{ qw/title subject/ };
delete $metadata{$_} for qw/title subject/;


my $title = join ' // ', ( @title ? @title : 'Unknown title' );

my $metadata = 
    join "\n",
        map { qq[<meta name="$_" content="$metadata{$_}">] }
                   sort keys %metadata;


print <<EOF;
<html>
<head>
    <title>
        $title
    </title>
    $metadata
</head>
<body>
EOF

# Might be faster to use sysread and read in larger blocks

open F, "pdftotext $file - |" or die "$0: failed to run pdftotext: $!";
print escapeHTML($_) while ( <F> );
close F or die "$0: Failed close on pipe to pdftotext for $file: $?";

print "</body></html>\n";


# How are URLs printed with pdftotext?
sub escapeHTML {

   my $str = shift;

   for ( $str ) {
       s/&/&amp;/go;
       s/</&lt;/go;
       s/>/&gt;/go;
       s/"/&quot;/go;
       tr/\014/ /; # ^L
    }
   return $str;
}

-- 
 David Norris
  Dave's Web - http://www.webaugur.com/dave/
  Augury Net - http://home.webaugur.com/
  ICQ - 412039
Received on Thu Oct 10 23:58:41 2002