Skip to main content.
home | support | download

Back to List Archive

Re: Filters/HTTP (was:Documentation structure)

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Wed Dec 13 2000 - 21:39:52 GMT
At 09:33 AM 12/13/00 +0100, Rainer.Scherg@rexroth.de wrote:
>> I'm unclear if you can use filters in http mode.  The documentation
>> indicates that a URL is passed, which would mean that the 
>> filter would also
>> need to retrieve the remote document first -- a process that 
>> isn't really
>> related to filtering.
>
>not quite correct.
>
>The following parameters are passed to a filter script:
>   - file path to index
>   - real path or url
>
>In case of file indexing "file path" and "real path" are the same.
>
>Passing the real path/URL is just for information or special purpose, mostly
>not used by the filter program.

So does that mean you can't currently use filters in the httpd access method?

>... but has also some disadvantages:
>
>  - you have to implement a communication protocol between swish
>    and filters.
>    Swish and the filter has to know, when a new document starts and
>    what document.

True.  I don't see that as a big issue.  You just define a protocol.  First
line/record lists number of line that are in the header, second record is
the content-length or some such thing.  Or to make it complicated, there's
SOAP ;)

>  - you cannot use simple scripts.

For backward compatibility, I'd expect to leave in the existing filter
system.  Just like Apache where you can write CGI script that are slow, but
if you need to process a lot of requests then it makes sense to spend the
time to write an Apache module.  So write simple filters when your site is
small, then write the filter pipe/server programs when indexing time is an
issue.

>  - you have to install a multi filter protocol.
>    (there are more than pdf-filters).

Of course.  Since filters should be able to work with both httpd and file
access methods then it might make sense to use content-type (and maintain a
mime.types file for local files).  From my examples:

  DocumentFilter text/html /usr/local/bin/htmlstrip
  DocumentFilter .gz /usr/local/bin/expand_gz.pl

If you specify a content-type then it uses the content-type as returned by
httpd or the content-type mapped from the mime.types file for local files.
If you specify an extension then it uses that (with warning about doing
that with httpd).

>  - you have still to fork/exec the filter programs (xpdf, gostscript,
>    catdoc, or whatever).

Well, of course it would only make sense to write this type of filter if
you were using something like Compress::Zlib or other internal library in
your filter so you don't have to fork for every request.

>On httpd method IMO most time is spent retrieving the documents from the
>net.
>Following your proposal, it would make sense to have a multithreaded swish
>engine, with a httpd-read-ahead. But this would mean a major redesign of
>swish.

Exactly.  That's exactly why I think httpd should be moved out of swish --
or at least provide an interface to an external document source provider.
An external spider could use all sorts of tricks to feed documents to swish
fast.  And those methods would be site dependent.  With the file access
method I'm not sure I understand how multithreaded would help if running on
a single processor system -- I assume the file system can feed files to
swish as fast as swish can index them.  So if you wanted to index remote
documents faster you would write a faster spider.

Of course, if you are spidering/filtering someone else's site you probably
don't want to hit them too hard and fast so the forking issue is moot.  And
if you are spidering your own site then the file access method might be
better if possible, or use your own document source provider or filter if
you must convert the source on the fly.

The only point, of course, is performance and scalability.  If filters are
an important feature of swish then there should be a method of using them
with out a much of a performance hit.  Don't want to give anyone a reason
not to use swish ;)

>But we should keep the proposal in mind.

Sound fair to me.


Bill Moseley
mailto:moseley@hank.org
Received on Wed Dec 13 21:42:34 2000