Skip to main content.
home | support | download

Back to List Archive

Incr. update of site and missing index files.

From: Douglas Smith <douglas(at)not-real.SLAC.Stanford.EDU>
Date: Tue Apr 15 2003 - 22:28:05 GMT
Ooops I have been forgetting to change the subject  on these
threads.

On Tuesday 15 April 2003 02:17 pm, Bill Moseley wrote:
> > 5. To update the site quickly I created a second inc. update
> > index.  This also gets merged into the larger index in a
> > periodic manner, and after this merge the index doesn't exist
> > until the next update.  This would produce an error with your
> > current swish.cgi and swish-e search, since I am asking for
> > an index which doesn't exist.  I changed the swish.cgi such
> > that it accepts multiple indexes for searching, but checks
> > on their existence before passing the list onto the swish-e
> > executable, so there is no error.  Is this a feature which
> > perhaps should be added to the swish.cgi or even swish-e, if
> > there are multiple index for search, ignore one if it is not
> > there instead of producing an error?
>
> There is code now to add files to an index.
> It has not been tested much, but would work good for archives.  Grab cvs
> or a swish-daily and run ./configure --help
>
> So you have something like:
>
>    full_index + incremental_index
>
> and once in a while you merge the incremental_index into the full_index
> and then incremental_index no longer exists.  Is that the problem?
> I guess I'd stat the file, too.  Or maybe create a dummy index for the
> incremental_index with a small file entry with one word.  Stat()ing is
> probably faster.

Well, the full index of 150,000 pages turns out to be two files
each 200MB in size, and takes hours to produce.  I create a daily 
index (which takes a few seconds so that is fine) and then merge 
at the end of the day.  

During the day I re-create the incremental index every 15mins,
and people are very happy with this.  I ask swish-e to search
the two indexes full + incr.  But after the merge, this
incr. index doesn't exist for 15mins, so swish-e will produce
an error when trying to search the two indexes.

I could change the config file for those 15mins each day, but
I don't really want to do that.  So, I put in a test on the
existance of the index file in the swish.cgi.  I will test each
index asked for, and push it into a second array if it exists.
Then feed the second array of index files to swish-e, so there
is no error.  Now the config files don't change, and the search
always works.

So, yes, this is just a stat on the index file to see if 
it exists.

I worry about the idea of adding single files if this takes
too long.  I mean right now re-making the incr. index takes
a couple secs.  If the full index is 200MB, will it take less
than a couple sec. to add a page to this index?

Is this "missing incr. index after merge" not a problem for
other sites?

Douglas

-- 
-----------------------------------------------------------
Douglas A. Smith                  douglas@slac.stanford.edu
Office: Bld 280, Rm 157                       (650)926-2369
-----------------------------------------------------------
Received on Tue Apr 15 22:29:00 2003