Skip to main content.
home | support | download

Back to List Archive

Re: Incremental indexing?

From: Keith Thompson <kjt(at)>
Date: Sat Feb 23 2002 - 03:14:59 GMT

Thanks for the quick reply.

>I'd stay away from merge.

Can you give me some specifics?  Is it just plain broken,
or does it have specific known problems?

>Have you looked at other search engines that are based on something like
>Berkeley DB?  That might solve your problem of new documents coming in fast.

I've looked at others.  I just find swish-e to be a little more
flexible, I can completely hack it up if necessary, and despite
being a UNIX lifer I'm forced to use something that can live in
a Microsoft world with the least amount of pain.  Plus, swish-e
is darned fast and has some excellent search features.  I like it
and the pricetag doesn't suck.

>If you were to store the documents compressed in something like MySQL as
>they come in then it makes managing a system like that a bit easier since
>you can timestamp things and not have to worry about duplicate files.  And
>then you always have copies of the original documents.

I've considered this, but we're talking about enough potential data that
storage of the files is prohibitive.  Plus, in some cases the content
of the files are such that I'm not supposed to be keeping them for
security reasons.  So, as much as such a thing like this would help,
I [unfortunately] don't want to consider it unless necesary.

>The problem, of course, is swish creates a reverse index: words point to
>files not the other way around.  So currently there's no way to say what
>words belong to a given file.

That's what I thought from looking for a short time at the source.
I was hoping you'd say otherwise.  :)

>Swish uses it's own database.  But swish can be compiled to use Berkeley
>DB[1].  It's a lot slower indexing, but that might give a path to
>incremental indexing for both adding records and for deleting records (with
>an additional table, like I said above, to track words for each file).

Speed here is not nearly the crucial issue as is maintaining
an index with integrity and "ease".  I'll look into this.

Thanks -keith
Received on Sat Feb 23 03:15:30 2002