[swish-e] Swish-e not indexing doc or PDF files

From: Liam Buchanan <Liam.Buchanan(at)>
Date: Tue Feb 12 2008 - 00:26:57 GMT
Hope someone can suggest a solution to this frustrating problem.
We are running swish-e on our development server that indexes our
production intranet server. However the problem lies in the inability
for the indexing to process .doc or PDF files. When the search reaches a
hyperlink that is linked to a PDF or doc file the process halts and the
error message is produced below (under output)
 Before running swish-e, we connect to our production server via a proxy
connection first (ntlmaps)
The search indexing runs fine for typical text.
Here are the specs of our environment --

Swish-e Version:
Swish-e version 2.4.5 
OS (development and production):
Windows 2000 
Proxy port application for remote index on another server:

.conf file contents: - the hashed out filters are other ones I have
tried. We have also attempted to output the results to a text file - the
output is garble.

#FileFilter .doc       /usr/bin/antiword "'%p'"
#FileFilter .PDF       /usr/bin/pdftotext   "'%p' -"
#FileFilter .PDF       c:\SWISH-E\bin\pdftotext '"%p" -htmlmeta
#FileFilter .doc       c:\SWISH-E\bin\catdoc.exe '-s8859-1 -d8859-1 "%p"
> temp.txt'

FileFilter .doc       c:\SWISH-E\bin\catdoc.exe "-s8859-1 -d8859-1 %p"
FileFilter .PDF       c:\SWISH-E\bin\pdftotext.exe "%p"

IndexOnly  .txt .ps .PDF .html .htm .doc .rtf .xls .mcd .for .ini
IndexOnly  .eps .pcm .c .h .cc .m .sh  .ppt
IndexOnly  .for .cpp


 (650 words)
http://*******/dsdweb/v4/apps/web/content.cfm?id=3150 - Using HTM
L2 parser - http://******/dsdweb/v4/apps/web/content.cfm?id=3150
:784: error: Unexpected end tag : a
ML = "[<a href=\"#\" onclick=\"cntCtrlsState(\'hide\'); return

 (523 words)
http://******/dsdweb/v4/apps/web/secure/docs/103.doc - Using TXT
2 parser -  (no words indexed)

At this point the whole process freezes.

Any ideas???


