Re: swish-e on a large scale

From: Peter Karman <karman(at)>
Date: Thu Sep 30 2004 - 16:11:20 GMT
Hi Aaron,

Glad to see Apple is joining the swish-e ranks. :)

Aaron Levitt wrote on 09/30/2004 10:53 AM:
I ran the indexer with the following command:
> ./bin/swish-e -S prog -c swish.conf.

Can you send along the contents of swish.conf? those might be helpful.

> So, I have the following questions:
> 1. I expect to have over 1,000,000 documents in our archives as things 
> progress.  Is this pushing the limits of swish-e?

I think there are folks on this list doing in excess of a million docs, 
but perhaps in smaller groups, depending on how often they need to be 
reindexed. One thing I like about swish-e is the ability to search 
multiple indexes simultaneously.

So yes, I think you can do a million, but for admin purposes, you might 
want to identify subsets and split them into smaller indexes.

> 2. I have seen the indexer hit my robots.txt multiple times, is there a 
> way to check on the progress to see if/when it will finish indexing?

Bill will likely have a better idea than me.

> 3. What should I do regarding the current index process?  I'm afraid to 
> stop it, because I don't want to have to start the indexing all over 
> again.

hmm. I'd let it go just for curiousity's sake. But I understand your 
concern. Is there a way you could benchmark the index size via the -S fs 
method, so you know what you're aiming for? I'm just wondering if you 
could identify whether the bottleneck is the spider or really is the 

> 4. Do you have any recommendations on what I can do to improve this 
> process?

Like I said above, splitting up the docs into subsets depending on how 
often they need to be indexed can be helpful. It's also a nice way to 
limit the scope of a search, just by selecting which indexes are 
searched. That way you needed futz with special metanames, etc.

