Skip to main content.
home | support | download

Back to List Archive

[swish-e] Current favorite PDF filter on FreeBSD?

From: David Brown <dave(at)not-real.davidhbrown.us>
Date: Thu Sep 02 2010 - 01:26:07 GMT
Anyone have a favorite PDF filter other than pdftohtml on FreeBSD (or *nix)
they're using these days? A few minutes of googling didn't turn up anything
more up-to-date than xpdf and pdftohtml. I couldn't find anything like a
HTML driver for GhostScript
(http://www.ghostscript.com/doc/current/Devices.htm#Display_devices).

I've been using pdftohtml 0.39 (http://sourceforge.net/projects/pdftohtml/)
for a number of years as a swish-e filter, but after upgrading to Adobe CS5,
I find that it can no longer process these files unless I monkey around with
exporting through Acrobat's preflight tool which is not practical for an
automated workflow. (Running Acrobat 8 in a VM would be a possibility, but
is inconvenient.)

I was able to install a recent xpdf (http://www.foolabs.com/xpdf/) through
the FreeBSD ports collection and get what I needed for the moment by using
the -htmlmeta option of their pdftotext utility. I see that there is a 0.40a
that uses a more recent xpdf (3.02)  and so might work, but (a) I'm not too
keen on using alpha software in production and (b) the FreeBSD port is still
on an older version (pdftohtml-0.39_5) and I'm have trouble getting 0.40a to
compile from source on this system.

Side question: I thought I read somewhere that swish-e's HTML parser will
weight text in headings more heavily than in regular text, but I'm not
finding that in the documentation. Is this in fact the case? If not, then I
might as well just stick with pdftotext.  If it does, then I'll try harder
to get pdftohtml 0.40a compiled or look into any alternatives you all might
suggest.

I suppose pdf -> ghostscript -> pdf -> pdftohtml might work, if slowly... at
least I've got things arranged so I never have to index all 7000+ pdfs at
once!

Thanks,
Dave
--
David Brown
dave@davidhbrown.us


_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Wed Sep 1 21:26:16 2010