Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] HTML Parser chokes when indexing image pdf's

From: at <Peter>
Date: Wed, 14 Mar 2012 22:17:09 -0500
Dr Michael Daly wrote on 3/14/12 9:14 PM:
> The funny thing is that *no* Filefilter options are specified in my
> swish1.conf:
>  IndexOnly .htm .html .txt .doc .pdf .xls
>  IndexContents TXT* .txt
>  DefaultContents HTML*
> I can see both /opt/bin/catdoc and /opt/bin/pdttotext , with /opt/bin
> being in $PATH so I presume there must be some hard coding within swish-e
> that picks them up without the configuration of eg FileFilter
> Should these directives be added?:
> FileFilter  .pdf    pdf2html
> FileFilter .pdf     pdftotext   "'%p' -"
> FileFilter .doc     /opt/bin/catdoc "-s8859-1 -d8859-1 %p"
> If not, can the parsing errors be ignored?

swish-e is trying to parse your .pdf as HTML, because you've not specified a
filter. You must specify a filter for anything that is not txt, html or xml.

Peter Karman  .  .  peter(at)
Users mailing list
Received on Thu Mar 15 2012 - 03:17:12 GMT