Skip to main content.
home | support | download

Back to List Archive

Re: swish-e on a large scale

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Thu Sep 30 2004 - 16:26:32 GMT
On Thu, Sep 30, 2004 at 08:51:35AM -0700, Aaron Levitt wrote:
> I began the indexing approximately 72 hours ago, and it hasn't ended 
> yet.  It is running on a G3 450Mhz machine with  576Mb of RAM.  I can 
> see swish-e hitting my webserver, and the .temp database seems to 
> continue to grow.  I ran the indexer with the following command: 
> ./bin/swish-e -S prog -c swish.conf.

Are you using the -e option?  If not you have likely run out of RAM,
or at least the hash tables are getting so big that indexing has
slowed way down.  Did you look at free(1) and vmstat(8) and other
tools to see how your machine is holding up?

Did you test things out with smaller sets of files first?


> So, I have the following questions:
> 
> 1. I expect to have over 1,000,000 documents in our archives as things 
> progress.  Is this pushing the limits of swish-e?

I'm tempted to say yes, but I know others on the list have/are
indexing that many docs.  The basic problem is swish is designed to
use RAM to be fast -- but also Jose has added features like -e and
also a new btree database back-end (not enabled by default).

> 2. I have seen the indexer hit my robots.txt multiple times, is there a 
> way to check on the progress to see if/when it will finish indexing?

That's interesting -- I wouldn't think it would hit the robots.txt
file more than once.  I'll look at that.

> 3. What should I do regarding the current index process?  I'm afraid to 
> stop it, because I don't want to have to start the indexing all over 
> again.

Well, you can strace the process to see what it's doing.  But even the
spider doesn't know when it will be done until it's actually done.

If you really wanted to stop you could sighup the spider just index
what's happened so far.  Then use swish-e to dump out all the paths to
a dbm file, then next time spider but reject any URLs in the dbm file.
Then merge the two indexes.

> 4. Do you have any recommendations on what I can do to improve this 
> process?

A few ideas

In general, you can spider separately from indexing.  Just capture the
output from the spider to a file and when done pipe that into swish.

Again, you can sighup the spider to have it stop processing and then
grep out the URLs and skip those when you start up spidering again.
You will end up fetching a lot of files you don't really need (I'm
working on a patch to spider.pl right now to use HEAD request which
would speed this up).

You might be able to index the raw email messages faster than
spidering the mail archive.

Since you are indexing a mail archive (where old messages don't
change) then you should try building swish with the
--enable-incremental option.  And then you can *add* files to the
index as needed.  It still requires some of the normal processing (like
presorting all the records) but should be faster that reindexing.

You might break your archives up by year and create separate indexes.
Then you can either search multiple indexes or merge them as needed.

That help at all?


-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Thu Sep 30 09:26:46 2004