Skip to main content.
home | support | download

Back to List Archive

Filtering Documents with SWISH::Filter

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Tue Sep 17 2002 - 21:04:47 GMT
Hi All,

We have been working on a new system for filtering documents (e.g.
converting pdf to html, or MS Word to text).  The new system is suppose to
provide a unified interface for filtering documents (using Perl), and
hopefully will make it easy for people to contribute new filters.

The original plan was to add the feature after 2.2 was released, but due to
some limitation in the -S http spidering method, it's now part of 2.2.

Swish already has a FileFilter directive that can be used to filter
documents while indexing.  But FileFilter has a few problems.  One is that
if the filter uses a perl script then it can be slow since the perl script
is compiled for every request.  Also, separate filters are used for each
type of document that needs to be converted.  Another problem is that
FileFilter doesn't really work with http method because FileFilter works
with file extensions not content-types.

The idea for this new filter system is that there's one perl module
"SWISH::Filter" that can be called to filter any document.  Programs such
as swishspider (-S http) or spider.pl (-S prog) or any other -S prog
program can have a single interface for filtering any type of document.
There are plug-in modules that can be installed to add more filters as needed.

For example, if you are already indexing PDF and MS Word docs using this
system and you wish to index ID3 tags from your MP3 files you just install
the MP3 filter (and dependent modules) and you don't need to add new
FileFilter rules or adjust -S prog programs at all -- now those files can
be indexed.

We are looking for testers, of course.  The docs on using it are thin, but
it's already incorporated in swishspider and spider.pl so using it is just
a matter of installing dependencies (e.g. catdoc or xpdf) and setting
PERL5LIB to point to the location of the filters.

It's somewhat ironic that the reason this was added now was to provide
filtering when using the -S http indexing method.  Loading a bunch of perl
modules for each file requested can slow down spidering, so for best
performance I recommend using -S prog with spider.pl over -S http.

I'm trying to get 2.2 out today or tomorrow, otherwise it will have to wait
until next week.


-- 
Bill Moseley
mailto:moseley@hank.org
Received on Tue Sep 17 21:08:23 2002