Skip to main content.
home | support | download

Back to List Archive

pdf2xml problem while indexing pdf files

From: <wayne.schomaker(at)not-real.state.co.us>
Date: Wed Nov 12 2003 - 20:13:01 GMT
I have been attempting to use the pdf2xml program without any success.  I
am running SWISH-E and copied a program (first item immediatelly below) from
www.linuxjournal.com/articles.php?sid=6652 which uses this program to index
the PDF files. The SWISH-E program works properly from both a browser and
the command line when indexing regular text files. 

(howto-pdf-prog.pl file)
#!/usr/bin/perl -w
use pdf2xml;
my @files =
system ('find /var/www/html/ccsp/docs/ -name *.pdf -print');
# system ('find /var/www/html/ccsp/docs/ -name *.pdf >
/var/www/html/ccsp/docs/results.file');
for (@files) {
chomp();
my $xml_record_ref = pdf2xml($_);
print $$xml_record_ref;
}

The following is the configuration file I am using:

(howto-pdf-conf file)
IndexDir	./howto-pdf-prog.pl
# prog file to hand us XML docs
IndexFile	./howto-pdf.index
# Index to create
UseStemming	yes
MetaNames	swishtitle swishdocpath+	

When executed, the following is the result:

[root@DPA2 ccsp]# swish-e -c howto-pdf.conf -S prog
Indexing Data Source: "External-Program"
Indexing "./howto-pdf-prog.pl"
Error: Couldn't open file '65280'
./howto-pdf-prog.pl: Failed close on pipe to pdfinfo for 65280: 256 at
pdf2xml.pm line 129.
Removing very common words...
no words removed.
Writing main index...
err: No unique words indexed!

I and my tech support cannot figure out what "..file 65280.." is.  There is
no such filename anywhere on the server and it is not a PDF file in our test
directory (../ccsp/docs/). We are at a loss as to what to do next.

Has anyone else experienced a smimilar problem who can help? Thank you for
any assistance you can provide.

Wayne Schomaker
303-239-4394
Received on Wed Nov 12 20:13:29 2003