One of my few concerns regarding swish-e is that (2 words) get indexed with
pdfs, and it's my guess that they are garbage. I'm extremely interested in
obtaining your patch.
I've added an simple "enhancement" (read: kludge hack) to the swishspider
which allows you to restrict and/or require URLs in the links that the
spider
indexes. It would be really nice if you could simply use a directive
instead. Currently I have a bots directory for each index that has special
restrictions.
----
John Leth-Nissen
Web Developer
Gulfstream Aerospace Corp.
Rainer Scherg wrote:
> Hi!
>
> In August last year I wrote a message in this eMail-list
> that I´ve done some enhancements which enable swish (1.1) to index
> non-HTML files like PDF or other documents types (filter option).
>
> Since then I got occasionally requests how to do this and where to
> get the source. Due to the requests I'm adapting the small enhancements
> to swish-e 1.3.2.
>
> If there is a public interest, I would try to get a small webspace
> to provide the source - instead of sending it via email on each request.
>
> ---
> To describe the changes to swhis in short:
> new config directives:
> FilterDir <path-to-filter-progs>
> FileFilter <file-ext> <filterprog>
>
> e.g.:
> FilterDir /usr/local/etc/httpd/sbin/filters
> FileFilter .pdf pdf-filter.sh
> FileFilter .doc ms-wword-filter.sh
> FileFilter .ps ps-filter.sh
> FileFilter .gz gzip-filter.sh
>
> e.g. pdf-filter.sh - script:
> ---
> #!/bin/sh
> # Convert file in arg1 to txt on stdout
> /usr/local/bin/pdftotext "$1" - 2>/dev/null
> ---
>
> Regards Rainer
Received on Sun May 9 05:51:34 1999