Skip to main content.
home | support | download

Back to List Archive

Re: A modularized view of a search engine

From: Magnus Bergman <magnus.bergman(at)not-real.observer.net>
Date: Thu Oct 09 2003 - 16:44:33 GMT
First of all I must say I'm sorry for my ignorance. The whole swish-e
system seemed to work more like I wanted then I knew of. It just used
other solutions from what I expected.

On Wed, 8 Oct 2003 19:38:08 +0200
Bernhard Weisshuhn <bkw@weisshuhn.de> wrote:

> On Wed, Oct 08, 2003 at 09:53:29AM -0700, Magnus Bergman
> <magnus.bergman@observer.net> wrote:
> 
> > Searching:
> [...]
> >   Retrieve:
> >     Retrieve the contents of a document by its identifier. This is
> >     the exact same thing as in the indexing task above. It should be
> >     handled by the exact same routines. As far as I can see, swish-e
> >     does nothing beyond returning the document identifier. I think
> >     it should also support some way to create a data stream from the
> >     identifier.
> 
> You might want to check out what swish-e properties are for as opposed
> to MetaNames. You can return quite a lot of information from the
> indexed content as long as you told swish to save that properties
> along with the index during its creation.

What I meant to say didn't come out quite right. I was only referring to
my diagram above and meant that swish-e only does the first subtask.

> I don't agree that swish-e should create the datastream as you say.
> The job of a searchengine imho should be to find stuff, not to do
> anything usefull with it. That should be the job of other pats of the
> framework, which *really* know how to handle that.

Now that I've checked a little more it seems that is does exactly what I
wished for. My mistake.

> Take indexing pdfs
> for example. If swish-e had 'native' support for it, we would have to
> include huge libraries (adding dependencies and bloat) to do domething
> that xpdf most probably can do much better. If a new pdf revision
> needs to be supported, you update xpdf and thats that.

This is not the job of the retrieve module (as you seem to imply?)
rather a job for the filter module (or even view module). If you are
going to index pdf documents you need a filter to convert it into
something the indexer understands anyhow (but I don't say that this has
to be packaged together with swish-e). When displaying the document it
might or might not be suitable to use another filter, but it's still
the same task. The actual displaying of the document is definitely not
the task of the search engine, I agree with that.

> > [...] Some other products I use today and would want to add support
> > for in swish-e (or use swish-e in) includes:
> 
> > Gnome Virtual File System
> > GStreamer (http://www.gstreamer.org/)
> > Scrollkeeper
> > Yelp (http://www.gnome.org/softwaremap/projects/yelp/)
> 
> Please don't be offended if I don't seem to get your point.
> 
> What do you mean by support? I fail to see how these are "unsupported"
> as long as one can write a wrapper that retrieves the contents and
> converts them to xml for siwsh to index.

It was just me who misunderstood a few things. It seems that most of the
things I wanted to do is already fully possible to do. Some other (less
important ones) needs a few things to be moved from the indexer to the
library, I think. I will perhaps come back to that later.

> I fear you're about to add a lot of dependencies in our compact litte
> swish-e without adding much benefit. I think using perl-scripts with
> swishe's -S prog option gives so much power (think CPAN) it should be
> pretty easy index all these contents (and many many more).

I was thinking more in the line of adding optional dependencies, like in
PHP for example. The result would not be bloat but to make the
executable and the library smaller. But it seems it doesn't need to be
changed very at all in order for my ideas to be possible.

> Filtering and viewing imho should not be the responsibility of the
> indexing engine.

I agree, that wasn't what I meant. The core of my ideas is the
modularized view of the whole index/search system like this:

  Gather: Decide which documents to be indexed. The swish-e's indexer
  does this with a crawler or is some Perl script used? (This could at
  least be a separate module.)

  Retrieve: Get the native contents of a document. This often involves
  downloading a file over the network and must be done to index the
  document. The exact same thing needs to be done again then the
  document is viewed. So the same module could be used in both cases.

  Filter: Convert from one mime-type to another.

  Index: Index the data fed into it. This is of course a part of
  swish-e's indexer program. I also think that it is the only part that
  needs to be. Everything else can be moved out and optionally replaced
  with equivalent modules. And, well, this seems to be the case already.

> What we would get from that is the notion that
> swish-e 'supports' say scrollkeeper, but doesn't 'support' say
> postgres, to pick something random. This makes no sense of course. The
> proper way to index *any* content is to define exactly how to retrieve
> it, what parts to index (filtering) and know what to do with
> searchresults, just like you said in your mail.
> This is exactly what the several wrappers for the engine do.
> You can include as much clues the frontend needs for interpreting the
> data in swish-e's properties.

Are all these Perl based? If there are some wrappers that are written in
C instead, can you please direct me there?

> Or did I completely misread your mail and you actually want to supply
> those wrappers for the filters and prog-bin directories of the
> distribution? In this case of course: Excellent idea, go ahead! ;)

If I can write it in C instead of Perl, then yes, that might very well
be what I wanted to do. I've already written a few such 'wrappers' to
use in the system I'm working, so it wouldn't be very hard to adjust
them to be tailored for swish-e.
Received on Thu Oct 9 17:20:02 2003