Skip to main content.
home | support | download

Back to List Archive

Re: A modularized view of a search engine

From: Magnus Bergman <magnus.bergman(at)>
Date: Thu Oct 16 2003 - 15:10:26 GMT
On Thu, 9 Oct 2003 10:32:36 -0700 (PDT) wrote:

> On Thu, Oct 09, 2003 at 05:00:44PM +0200, Magnus Bergman wrote:
> > The main point: the job of retrieving a (fixed size, linear)
> > document only needs to be implemented once for the whole system.
> > Each and every program that needs this functionality can use the
> > same code.
> Good point.  Swish uses Perl's LWP code everywhere when it needs to
> grab a URL.  No reinventing the wheel there.

I was thinking in a wider perspective. What I do is to integrate quite a
few different packages, including swish-e into a search system. They all
need to fetch documents from at least one none standard URL type (since
some commercial packages depend on it). And it would be nice if every
program could use the same plug-in for this. Gnome VFS is the best
solution I have found this far. Maybe I'll try to make LWP make use of
it (after I've found out what LWP is).

> > I must admit that I haven't looked must at SWISH::Filter since I
> > don't know Perl. Can it easily be used on the command line to
> > convert documents? And can other command line filters easily be used
> > with swish-e? (By easy I mean without writing any Perl code.)
> Just because you don't know Perl doesn't mean it isn't easy.

True. But Perl and I don't seem to get along. I will perhaps try it out
more anyway.

> Can SWISH::Filter be used at the command line?  Yes (well two ways:
> one is that you can run Perl from the command line, but the other is
> with the swish-filter-test program that is a simple wrapper for 
> SWISH::Filter.
> So you want:
>    url_list | url_fetch | swish-e

Yes, but more like this:

some crawler    \                                    / some indexer
another crawler  > | one fetcher | one filter set | <  another indexer
url-lists       /                                    \ swish-e

> But that's not going to be very general purpose.  How are you going to
> get the content-type or other HTTP header data to swish-e?  You can do
> this with swish-e:

Gnome VFS makes use of the mime-type (in a nice general purpose manner).
All other header data will get lost with my solution. What more is there
that is useful for indexing? Do I need it?

> | swish-e -S prog -i stdin -c some.config
> How would you modify that?

I don't really want to stop that from working, I just want more
flexibility. To make the following possible for example:

gst-launch crawler-src ! spider ! swish-sink

This creates a GStreamer pipeline with three elements (actually it will
become more, but they are included automatically):

crawler-src does what a crawler does and feeds spider with document
streams and tells the mime-types (This is a hypothetical element, but
could be created). Spider is an existing element (shipped with
GStreamer) that looks at the mime-types that the crawler provide and the
swish-sink accepts and automagically figures out how do the eventual
conversions needed (and loads other elements that does the job). The
swish-sink then just indexes what comes in.

Another possibility could be to link the indexer to the crawler, if
someone likes to do that.

I'm currently not using anything but the swish-e indexer and the library
(I find the rest redundant). The library works exactly like I want it
to (I think). The indexer is not completely tailored for my needs. I
think it would be nice if the indexer could be put into a library too,
and only have an API for feeding document streams into it.

> Swish-e is no where near perfect, so I look forward to your input
> after you familiarize yourself with the docs.  The docs are not
> perfect either, so input there is helpful, too.

Honestly, I have no intention to learn Perl just for this. But if there
is something else I could help with I will consider it. Is there for
example any interest in taking advantage of gnome VFS or GStreamer?
Received on Thu Oct 16 15:10:38 2003