Skip to main content.
home | support | download

Back to List Archive

Re: Indexing non HTML files... (PDF, DOC, ...)

From: <john.leth(at)not-real.gulfaero.com>
Date: Sun May 09 1999 - 12:55:47 GMT
One of my few concerns regarding swish-e is that (2 words) get indexed with
pdfs, and it's my guess that they are garbage. I'm extremely interested in
obtaining your patch.

I've added an simple "enhancement" (read: kludge hack) to the swishspider
which allows you to restrict and/or require URLs  in the links that the
spider
indexes. It would be really nice if you could simply use a directive
instead. Currently I have a bots directory for each index that has special
restrictions.

----
John Leth-Nissen
Web Developer
Gulfstream Aerospace Corp.




Rainer Scherg wrote:

> Hi!
>
> In August last year I wrote a message in this eMail-list
> that Ive done some enhancements which enable swish (1.1) to index
> non-HTML files like PDF or other documents types (filter option).
>
> Since then I got occasionally requests how to do this and where to
> get the source. Due to the requests I'm adapting the small enhancements
> to swish-e 1.3.2.
>
> If there is a public interest, I would try to get a small webspace
> to provide the source - instead of sending it via email on each request.
>
> ---
> To describe the changes to swhis in short:
> new config directives:
>      FilterDir   <path-to-filter-progs>
>      FileFilter  <file-ext> <filterprog>
>
> e.g.:
>      FilterDir   /usr/local/etc/httpd/sbin/filters
>      FileFilter  .pdf   pdf-filter.sh
>      FileFilter  .doc   ms-wword-filter.sh
>      FileFilter  .ps    ps-filter.sh
>      FileFilter  .gz    gzip-filter.sh
>
> e.g. pdf-filter.sh - script:
> ---
> #!/bin/sh
> # Convert file in arg1 to txt on stdout
> /usr/local/bin/pdftotext "$1" - 2>/dev/null
> ---
>
> Regards Rainer
Received on Sun May 9 05:51:34 1999