Skip to main content.
home | support | download

Back to List Archive

Re: AutoSwish

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Sat Feb 24 2001 - 15:54:28 GMT
At 03:56 AM 02/24/01 -0800, Sheni R. Meledath wrote:
>I have created an index file using SWISH-E. I want to update the index 
>file. I don't want to re-index the whole site. Only modified files should 
>be checked and added to the index. The problem now I am facing with 
>reindexing the whole site is memory problem. After indexing  many it 
>displays an error message. "Ran out of memory (Could not allocate enough)". 
>Can anybody help me to solve this situation.

Which version of swish are you running?  2.0 uses more memory since it
keeps track of word positions during indexing.  I can't remember when it
was added, but in the development version swish has the -e (economy) option
which saves memory during indexing.

Now, there's no way in swish to "update" the index file.  You need to index
all your documents at one time -- so you need enough memory to do so either
with or without -e.  (-e will be slower).

That being said, there are some ways to reduce how often you need to do
that.  If you want new files searchable one solution is to use a parallel
directory structure to your real data.  When you add a new file, make a
symlink in the parallel directory.  Then swish can index the parallel
directory of symlinks and now you have an incremental index of just the new
files.

Then when searching specify both your indexes with -f.  You can also merge
the index files, but that will take longer.  

Merging has some advantages over just using -f for searching, but requires
also requires a lot of memory for the merge.  In the development version
there has been a lot of work on making swish return results faster when a
search returns *many* results yet you are only asking for, say, a page of
twenty.  This only works on merged indexes, not -f indexes.

[Jose had just detailed in an email the differences between merging and using 
-f, but I can't seem to find it now.]

>If anybody can explain the 
>steps to execute the SWISH-E command from a Perl script, I can write the 
>rest of the script and run it from the browser.

There's many ways to do that.

If you don't want to use cron, then this is basically what I'd do -- 

1) fork 
2) return your content to the browser
3) become a server: close file handles, setsid
4) run swish.

If you want to prevent running more than one indexing job at a time then
use a lock file after step 1 so you can report back to the browser if
indexing started or not.

All this is explained in detail in perldoc perlipc.

There's a couple of ways to run indexing.  I would use a piped open() call.
 But frankly, I don't see why you couldn't just:

my $swish_output = `$command_to_index_swish`;
send_mail_to_me( $swish_output );

BUT - DO NOT USE that method if $command_to_index_swish gets ANY data from
a submitted form as it's a security risk (it goes through the shell).

A piped open is better as you have more control, but for simple swish
indexing backticks will probably be fine. 

You should also write STDERR to a file or catch $SIG{__WARN__} to return
simple errors.

  perl -we '$SIG{__WARN__} = sub { die "xx:@_:" }; print `sssjsj`'         
  xx:Can't exec "sssjsj": No such file or directory at -e line 1.


It would be nice if swish could do incremental indexing, but that's a major
design issue.

It would be also nice if you could pass swish a date or a file name and say
only index files after (or before that date).  The problem is that you
would be tempted to compare against the date of the main index file, but
you would possibly miss files that were added in between the time index was
started and the time the index file was created.  Maybe swish should open
the new index file before reading in files to set the date.




Bill Moseley
mailto:moseley@hank.org
Received on Sat Feb 24 15:59:03 2001