Skip to main content.
home | support | download

Back to List Archive

Re: Perl API and mod_perl/Incremental

From: Tac/Smokescreen <tac(at)not-real.smokescreen.org>
Date: Thu Feb 17 2005 - 15:33:23 GMT
Why does searching write into the index structures?  Is it required?  Does 
it cache the most recent searches in some way?

Tac
----- Original Message ----- 
From: "Bill Moseley" <moseley@hank.org>
To: "Multiple recipients of list" <swish-e@sunsite3.berkeley.edu>
Sent: Thursday, February 17, 2005 10:23 AM
Subject: [SWISH-E] Re: Perl API and mod_perl/Incremental


> On Thu, Feb 17, 2005 at 02:37:08AM -0800, Markus Peter wrote:
>> Can I already open the index files in my Apache mod_perl startup script
>> (=before the fork of the children) and it will automatically do the right 
>> thing
>
> I'm not sure.  Searching writes into the index structures, so you are
> going to get a copy of the memory anyway (copy-on-write).  Using a
> second mod_perl server with SWISHED (as Peter commented about)  might
> be a bit more efficient memory wise since there would be fewer child
> processes running swish.  If that's worth the trade-off of running a
> second mod_perl server is something you would have to determine.
>
> The act of opening the index doesn't use that much RAM.  Running
> searches can, though.
>
> Try opening the indexes in startup.pl and in child fork and report
> back the differences and how you measured it.
>
>> The other question I have is regarding incremental mode. So far I've
>> been using the traditional mode with cron jobs to update once or twice a
>> day, but I'd really like to convert the search to be "real time". How
>> stable is incremental mode? And "how incremental" is it? Can I use it,
>> to add/modify/remove documents from the search index on the fly, as they
>> are added/modified or is it rather targetted at batch processing a larger
>> number of updates (=merely a better merge)?
>
> I only know a little about incremental internals.
>
> It's not really on-the-fly.  It uses a different index format -- a
> btree structure that allow updates.  Deletions are made by marking
> that the file has zero words total, but doesn't really delete the
> word data.  So the index continues to grow.  It also means that the
> search engine really finds words from deleted files and then those
> files are checked to see if they have been deleted, and if so,
> not added to the result set.
>
> It's not really on-the-fly because, although you can add files to an
> existing index, the final stages of indexing are still done every
> time a file is added -- namely the presorted indexes have to be
> rebuilt.  I'm not 100% sure, but I suspect there's a time when the
> index is in an unstable state while adding files to the index.
>
> I tried a commercial search engine once -- I can't remember what it
> was (they kept emailing me for months after the "free trial" so you
> would think I would remember) -- but it truly allowed searches while
> it was indexing.  The down side was it took  f o r e v e r   to run
> indexing, and searches were not that speedy.  Yes, I suspect that was
> a trade-off for scalability.
>
>
> -- 
> Bill Moseley
> moseley@hank.org
>
> Unsubscribe from or help with the swish-e list:
>   http://swish-e.org/Discussion/
>
> Help with Swish-e:
>   http://swish-e.org/current/docs
>   swish-e@sunsite.berkeley.edu
>
> 
Received on Thu Feb 17 07:33:23 2005