Skip to main content.
home | support | download

Back to List Archive

Re: Filters/HTTP (was:Documentation structure)

From: <Rainer.Scherg(at)not-real.rexroth.de>
Date: Wed Dec 13 2000 - 08:34:54 GMT
> -----Original Message-----
> From: Bill Moseley [mailto:moseley@hank.org]
> Sent: Wednesday, December 13, 2000 1:27 AM
> To: Multiple recipients of list
> Subject: [SWISH-E] Re: Filters/HTTP (was:Documentation structure)
> 

> 
> I've mentioned this before, but I'm not sure how integrated 
> the HTTP method
> should be in swish.  I'm not saying that there shouldn't be a 
> way to spider
> documents, but rather that maybe there should be a modular 
> approach to the
> way the HTTP method is connected to swish.
[...]

Agreed...

We should have two methods in the future (IMO):

  - internal swish httpd spidering.
  - external feeding of swish.


The best program I know for spidering is "wget" (as you also mentioned).

Including parts of "wget" as internal spider into swish would provide
all the means we need:
  - spider level control
  - host span options
  - domain control
  - etc.




> 
> Now about filters.  Again, I don't use filters, but the current system
> looks like you define a file extension and a program that swish calls.
> 
>       FilterDir   /usr/local/apache/swish-e/filters-bin/
>       FileFilter  .pdf   pdf-filter.sh
> 
> pdf-filter.sh will get passed the name of the file to filter.  
> 
> I'm unclear if you can use filters in http mode.  The documentation
> indicates that a URL is passed, which would mean that the 
> filter would also
> need to retrieve the remote document first -- a process that 
> isn't really
> related to filtering.

not quite correct.

The following parameters are passed to a filter script:
   - file path to index
   - real path or url

In case of file indexing "file path" and "real path" are the same.

Passing the real path/URL is just for information or special purpose, mostly
not used by the filter program.

 
> Anyway, with the current system swish must fork and exec 
> /bin/sh -c for
> each document.  Forking isn't that expensive in modern 
> operating systems,
> but it still seems like it would be slower than just opening 
> up the filter
> program once and feeding it the documents one after another, 
> leaving the
> filter program in running in memory.
[...]


Yep, thats correct, but has also some disadvantages:

  - you have to implement a communication protocol between swish and
filters.
    Swish and the filter has to know, when a new document starts and what
document.

  - you cannot use simple scripts.

  - you have to install a multi filter protocol.
    (there are more than pdf-filters).

  - you have still to fork/exec the filter programs (xpdf, gostscript,
catdoc, or
    whatever).


On httpd method IMO most time is spent retrieving the documents from the
net.
Following your proposal, it would make sense to have a multithreaded swish
engine, with a httpd-read-ahead. But this would mean a major redesign of
swish.

But we should keep the proposal in mind.

cu - rainer


----------------------------------------------------------------------
This Mail has been checked for Viruses
Attention: Encrypted Mails can NOT be checked !

* * *

Diese Mail wurde auf Viren ueberprueft
Hinweis: Verschluesselte Mails koennen NICHT geprueft werden !
----------------------------------------------------------------------
Received on Wed Dec 13 08:37:58 2000