Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] swish-e - Help with indexing pdf´s

From: David Brown <dave(at)not-real.davidhbrown.us>
Date: Mon Jun 21 2010 - 23:19:25 GMT
Jjust in case it might be helpful to you, here are some filter settings I've been using successfully for a number of years:
 
#use FileFilters to process other than HTML
FileFilter .pdf "/usr/local/bin/pdftohtml" "-q -i -stdout -noframes %P"
FileFilter .doc "/usr/local/bin/abiword" "-t html -o fd://1 %P"
 
This is on FreeBSD; pdftohtml  is from  http://pdftohtml.sourceforge.net/  (I'm not sure why mine reports version 0.39 while the web page says 0.36 is latest; it is derived from xpdf)
 
(BTW,  trying to get abiword installed with no GUI was quite tedious and required lots of graphics-oriented dependencies that are now just wasting space on the system.)
 
BTW, lots of nulls sounds like maybe you're getting the still-compressed data stream from the PDFs and the HTML conversion isn't happening. Try running your filter from the command line and see if you get HTML or gibberish/error. Also, make sure that you replace path-to-swish-e with your actual path to swish-e; on my system, this is /usr/local/bin/swish-e
--
Dave Brown
dave@davidhbrown.us
 
From: users-bounces@lists.swish-e.org [mailto:users-bounces@lists.swish-e.org] On Behalf Of pgeo@gmx.de
Sent: Monday, June 21, 2010 9:01 AM
To: Swish-e Users Discussion List
Subject: [swish-e] swish-e - Help with indexing pdf´s
 

Hi @ All,

i´ve a short question again:

first:
when i want to index pdf Files, must the Prog xpdf installed at the Server from which I start the index or at the Server from which I start the search in fact the Server where I call the swish.cgi

second:
when I start the Index I got errors like this:

 - Using HTML parser -  (98779 words)
  Document.pdf
Warning: Substituted 2397 embedded null character(s) in file '/Document1.pdf

and so on ... and i don´t know why.
In my swish.conf I wrote:

...
IndexOnly .htm .html .php .doc .xml .pdf
FileFilter /path-to-swishe/filter-bin\_pdf2html.pl "%p -" /\.pdf$/
...

and in my search results are no pdf´s
Do I have to write any more in the conf-file?
Perhaps did somebody have an idea?
Regards
Peter



-- 
GMX DSL: Internet-, Telefon- und Handy-Flat ab 19,99 EUR/mtl. 
Bis zu 150 EUR Startguthaben inklusive! http://portal.gmx.net/de/go/dsl



_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Mon Jun 21 19:19:43 2010