Skip to main content.
home | support | download

Back to List Archive

Re: A modularized view of a search engine

From: <moseley(at)>
Date: Wed Oct 08 2003 - 18:07:14 GMT
On Wed, Oct 08, 2003 at 09:53:33AM -0700, Magnus Bergman wrote:

> Indexing:
>   +--------+   +----------+   +--------+   +-------+
>   | Gather |-->| Retrieve |-->| Filter |-->| Index |
>   +--------+   +----------+   +--------+   +-------+
>   Gather:
>     Decide which documents should be indexed and generate a list of
>     them. In most cases each document is identified by a URL. But in
>     some cases other types of unique identifiers are used (for example
>     scrollkeeper, see below). This task is typically performed by a
>     spider, but other solutions are possible. I think swish-e handles
>     this in a good way.
>   Retrieve:
>     Retrieve the contents of a document by its identifier. In most cases
>     this means open and and read a file or get a file by http. This is a
>     very common task and is not specific to search engines at all. There
>     exists several good general purpose solutions to this already (see
>     below) and I think swish-e should be able take advantage of them.

For HTML, those are better done together since you have to Retrieve to 
know what to Gather.

>   Filter:
>     Transform the contents of a document from one mime-type to another
>     and perhaps change the encoding so the indexer can understand it.
>     Most indexers want text/plain, some also accepts text/html, text/xml
>     or some more specific mime-types. This is also a quite common task,
>     it could be solved once and for all, for everybody to use.


So what you describe above is basically how swish-e works.

> GStreamer (

> Scrollkeeper
>   This is a system for keeping track of documentation. It builds a
>   database of installed documents. Each document has a unique
>   identifier, but it is independant of the of where the document is
>   stored and its filename.

Sounds like a job for swish-e! ;)

Bill Moseley
Received on Wed Oct 8 18:11:22 2003