Re: [swish-e] HTML Parser chokes when indexing image pdf's

From: Dr Michael Daly <"Dr>
Date: Fri, 16 Mar 2012 00:05:18 +1100 (EST)
thanks very much...this part is more or less solved, though still getting
these errors for .doc and .pdf files (I presume bec they were not
originally .html files):
 error: htmlParseEntityRef: no name

I have added this filtering:
      	FileFilter .pdf pdftotext   "'%p' -"
	FileFilter .doc catdoc "-s8859-1 -d8859-1 %p"
	FileFilter .xls xls2csv "-s8859-1 -d8859-1 %p"


Dr Michael Daly wrote on 3/14/12 9:14 PM:
> The funny thing is that *no* Filefilter options are specified in my
> swish1.conf:
>  IndexOnly .htm .html .txt .doc .pdf .xls
>  IndexContents TXT* .txt
>  DefaultContents HTML*
> I can see both /opt/bin/catdoc and /opt/bin/pdttotext , with /opt/bin
> being in $PATH so I presume there must be some hard coding within
> swish-e
> that picks them up without the configuration of eg FileFilter
> Should these directives be added?:
> FileFilter  .pdf    pdf2html
> FileFilter .pdf     pdftotext   "'%p' -"
> FileFilter .doc     /opt/bin/catdoc "-s8859-1 -d8859-1 %p"
> If not, can the parsing errors be ignored?

swish-e is trying to parse your .pdf as HTML, because you've not specified
filter. You must specify a filter for anything that is not txt, html or

Peter Karman  .  .  peter(at)
