Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] Swish-e not indexing doc or PDF files

From: Dr Michael Daly <gp(at)not-real.holisticgp.com.au>
Date: Tue Feb 12 2008 - 12:20:42 GMT
Hi Laim
I am using linux, so am not sure if this is relevant, but my .conf entries
are slightly different eg
there is an apostrophe either side of the %p in each of the following:

FileFilter .pdf /usr/local/bin/pdftotext   "'%p' -"
FileFilter .doc /usr/local/bin/catdoc "-s8859-1 -d8859-1 '%p'"

And also these entries:
# Only the following type of files
    IndexOnly .htm .html .txt .doc .pdf

    # Tell Swish-e that .txt files are to use the text parser.
    IndexContents TXT* .txt

    # Otherwise, use the HTML parser
    DefaultContents HTML*

    # Ask libxml2 to report any parsing errors and warnings or 
    # any UTF-8 to 8859-1 conversion errors
    ParserWarnLevel 9

Rgds
Michael

-----Original Message-----
From: users-bounces@lists.swish-e.org
[mailto:users-bounces@lists.swish-e.org] On Behalf Of Liam Buchanan
Sent: Tuesday, 12 February 2008 11:27 AM
To: users@lists.swish-e.org
Subject: [swish-e] Swish-e not indexing doc or PDF files

Hi,
Hope someone can suggest a solution to this frustrating problem.
We are running swish-e on our development server that indexes our
production intranet server. However the problem lies in the inability
for the indexing to process .doc or PDF files. When the search reaches a
hyperlink that is linked to a PDF or doc file the process halts and the
error message is produced below (under output)
 Before running swish-e, we connect to our production server via a proxy
connection first (ntlmaps)
The search indexing runs fine for typical text.
Here are the specs of our environment --

Swish-e Version:
Swish-e version 2.4.5 
------------------------
OS (development and production):
Windows 2000 
------------------------
Proxy port application for remote index on another server:
ntlmaps-0.9.9.0.1
------------------------

.conf file contents: - the hashed out filters are other ones I have
tried. We have also attempted to output the results to a text file - the
output is garble.

#FileFilter .doc       /usr/bin/antiword "'%p'"
#FileFilter .PDF       /usr/bin/pdftotext   "'%p' -"
#FileFilter .PDF       c:\SWISH-E\bin\pdftotext '"%p" -htmlmeta
c:\SWISH-E\pdfoutput.txt'
#FileFilter .doc       c:\SWISH-E\bin\catdoc.exe '-s8859-1 -d8859-1 "%p"
> temp.txt'

FileFilter .doc       c:\SWISH-E\bin\catdoc.exe "-s8859-1 -d8859-1 %p"
FileFilter .PDF       c:\SWISH-E\bin\pdftotext.exe "%p"


IndexOnly  .txt .ps .PDF .html .htm .doc .rtf .xls .mcd .for .ini
IndexOnly  .eps .pcm .c .h .cc .m .sh  .ppt
IndexOnly  .for .cpp
------------------------


Output:

 (650 words)
http://*******/dsdweb/v4/apps/web/content.cfm?id=3150 - Using HTM
L2 parser - http://******/dsdweb/v4/apps/web/content.cfm?id=3150
:784: error: Unexpected end tag : a
ML = "[<a href=\"#\" onclick=\"cntCtrlsState(\'hide\'); return
false;\">Hide</a>


 (523 words)
http://******/dsdweb/v4/apps/web/secure/docs/103.doc - Using TXT
2 parser -  (no words indexed)
------------------------
------

At this point the whole process freezes.

Any ideas???

Thanks 



_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users

No virus found in this incoming message.
Checked by AVG Free Edition. 
Version: 7.5.516 / Virus Database: 269.20.2/1271 - Release Date: 11/02/2008
8:16 AM
 

No virus found in this outgoing message.
Checked by AVG Free Edition. 
Version: 7.5.516 / Virus Database: 269.20.2/1271 - Release Date: 11/02/2008
8:16 AM
 

_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Tue Feb 12 07:20:51 2008