On Wed, Oct 08, 2003 at 09:53:33AM -0700, Magnus Bergman wrote:
> Indexing:
> +--------+ +----------+ +--------+ +-------+
> | Gather |-->| Retrieve |-->| Filter |-->| Index |
> +--------+ +----------+ +--------+ +-------+
> Gather:
> Decide which documents should be indexed and generate a list of
> them. In most cases each document is identified by a URL. But in
> some cases other types of unique identifiers are used (for example
> scrollkeeper, see below). This task is typically performed by a
> spider, but other solutions are possible. I think swish-e handles
> this in a good way.
>
> Retrieve:
> Retrieve the contents of a document by its identifier. In most cases
> this means open and and read a file or get a file by http. This is a
> very common task and is not specific to search engines at all. There
> exists several good general purpose solutions to this already (see
> below) and I think swish-e should be able take advantage of them.
For HTML, those are better done together since you have to Retrieve to
know what to Gather.
> Filter:
> Transform the contents of a document from one mime-type to another
> and perhaps change the encoding so the indexer can understand it.
> Most indexers want text/plain, some also accepts text/html, text/xml
> or some more specific mime-types. This is also a quite common task,
> it could be solved once and for all, for everybody to use.
SWISH::Filter?
So what you describe above is basically how swish-e works.
> GStreamer (http://www.gstreamer.org/)
http://www.gstreamer.net/
> Scrollkeeper
> This is a system for keeping track of documentation. It builds a
> database of installed documents. Each document has a unique
> identifier, but it is independant of the of where the document is
> stored and its filename.
Sounds like a job for swish-e! ;)
--
Bill Moseley
moseley@hank.org
Received on Wed Oct 8 18:11:22 2003