Skip to main content.
home | support | download

Back to List Archive

Indexing non HTML files... (PDF, DOC, ...)

From: Rainer Scherg <Rainer.Scherg(at)not-real.rexroth.de>
Date: Fri May 07 1999 - 17:34:02 GMT
Hi!

In August last year I wrote a message in this eMail-list 
that Ive done some enhancements which enable swish (1.1) to index
non-HTML files like PDF or other documents types (filter option).

Since then I got occasionally requests how to do this and where to
get the source. Due to the requests I'm adapting the small enhancements
to swish-e 1.3.2.

If there is a public interest, I would try to get a small webspace
to provide the source - instead of sending it via email on each request.


---
To describe the changes to swhis in short:
new config directives:
     FilterDir   <path-to-filter-progs>
     FileFilter  <file-ext> <filterprog>

e.g.:
     FilterDir   /usr/local/etc/httpd/sbin/filters
     FileFilter  .pdf   pdf-filter.sh
     FileFilter  .doc   ms-wword-filter.sh
     FileFilter  .ps    ps-filter.sh
     FileFilter  .gz    gzip-filter.sh

e.g. pdf-filter.sh - script:
---
#!/bin/sh
# Convert file in arg1 to txt on stdout
/usr/local/bin/pdftotext "$1" - 2>/dev/null
---


Regards Rainer
Received on Fri May 7 10:37:38 1999