Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] Swish-e not indexing doc or PDF files

From: Liam Buchanan <Liam.Buchanan(at)not-real.dtrdi.qld.gov.au>
Date: Wed Feb 13 2008 - 00:13:04 GMT
Hi,
I have tried all these configurations.
I don't think it is a permissions error because I can open the file fine
in a browser from the remote server.

Thanks
 

-----Original Message-----
From: users-bounces@lists.swish-e.org
[mailto:users-bounces@lists.swish-e.org] On Behalf Of Dr Michael Daly
Sent: Tuesday, 12 February 2008 10:21 PM
To: 'Swish-e Users Discussion List'
Subject: Re: [swish-e] Swish-e not indexing doc or PDF files

Hi Laim
I am using linux, so am not sure if this is relevant, but my .conf
entries are slightly different eg there is an apostrophe either side of
the %p in each of the following:

FileFilter .pdf /usr/local/bin/pdftotext   "'%p' -"
FileFilter .doc /usr/local/bin/catdoc "-s8859-1 -d8859-1 '%p'"

And also these entries:
# Only the following type of files
    IndexOnly .htm .html .txt .doc .pdf

    # Tell Swish-e that .txt files are to use the text parser.
    IndexContents TXT* .txt

    # Otherwise, use the HTML parser
    DefaultContents HTML*

    # Ask libxml2 to report any parsing errors and warnings or 
    # any UTF-8 to 8859-1 conversion errors
    ParserWarnLevel 9

Rgds
Michael

-----Original Message-----
From: users-bounces@lists.swish-e.org
[mailto:users-bounces@lists.swish-e.org] On Behalf Of Liam Buchanan
Sent: Tuesday, 12 February 2008 11:27 AM
To: users@lists.swish-e.org
Subject: [swish-e] Swish-e not indexing doc or PDF files

Hi,
Hope someone can suggest a solution to this frustrating problem.
We are running swish-e on our development server that indexes our
production intranet server. However the problem lies in the inability
for the indexing to process .doc or PDF files. When the search reaches a
hyperlink that is linked to a PDF or doc file the process halts and the
error message is produced below (under output)  Before running swish-e,
we connect to our production server via a proxy connection first
(ntlmaps) The search indexing runs fine for typical text.
Here are the specs of our environment --

Swish-e Version:
Swish-e version 2.4.5
------------------------
OS (development and production):
Windows 2000
------------------------
Proxy port application for remote index on another server:
ntlmaps-0.9.9.0.1
------------------------

.conf file contents: - the hashed out filters are other ones I have
tried. We have also attempted to output the results to a text file - the
output is garble.

#FileFilter .doc       /usr/bin/antiword "'%p'"
#FileFilter .PDF       /usr/bin/pdftotext   "'%p' -"
#FileFilter .PDF       c:\SWISH-E\bin\pdftotext '"%p" -htmlmeta
c:\SWISH-E\pdfoutput.txt'
#FileFilter .doc       c:\SWISH-E\bin\catdoc.exe '-s8859-1 -d8859-1 "%p"
> temp.txt'

FileFilter .doc       c:\SWISH-E\bin\catdoc.exe "-s8859-1 -d8859-1 %p"
FileFilter .PDF       c:\SWISH-E\bin\pdftotext.exe "%p"


IndexOnly  .txt .ps .PDF .html .htm .doc .rtf .xls .mcd .for .ini
IndexOnly  .eps .pcm .c .h .cc .m .sh  .ppt IndexOnly  .for .cpp
------------------------


Output:

 (650 words)
http://*******/dsdweb/v4/apps/web/content.cfm?id=3150 - Using HTM
L2 parser - http://******/dsdweb/v4/apps/web/content.cfm?id=3150
:784: error: Unexpected end tag : a
ML = "[<a href=\"#\" onclick=\"cntCtrlsState(\'hide\'); return
false;\">Hide</a>


 (523 words)
http://******/dsdweb/v4/apps/web/secure/docs/103.doc - Using TXT
2 parser -  (no words indexed)
------------------------
------

At this point the whole process freezes.

Any ideas???

Thanks 



_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users

No virus found in this incoming message.
Checked by AVG Free Edition. 
Version: 7.5.516 / Virus Database: 269.20.2/1271 - Release Date:
11/02/2008
8:16 AM
 

No virus found in this outgoing message.
Checked by AVG Free Edition. 
Version: 7.5.516 / Virus Database: 269.20.2/1271 - Release Date:
11/02/2008
8:16 AM
 

_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users

---------------------------------------------------------------------------- 
Unless stated otherwise, this email, together with any attachments, is 
intended for the named recipient(s) only and may contain privileged and 
confidential information. If received in error, you are asked to inform the 
sender as quickly as possible and delete this email and any copies of this 
from your computer system network. 

If not an intended recipient of this email, you must not copy, distribute or 
take any action(s) that relies on it; any form of disclosure, modification, 
distribution and/or publication of this email is also prohibited. 

Unless stated otherwise, this email represents only the views of the sender 
and not the views of the Queensland Government. 
----------------------------------------------------------------------------
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Tue Feb 12 19:16:16 2008