Skip to main content.
home | support | download

Back to List Archive

RE: swish-e on a large scale

From: Aaron Bazar <aaronb(at)not-real.spamcop.net>
Date: Thu Sep 30 2004 - 16:27:15 GMT
Hi!

I am happy use of swish-e. I am not the expert, however...

I have well over 1 million docs in one index. It works fine. However, I
would never attempt to actually index them using the web server/spider. It
would be painfully slow. Is there any way you can do it without having to
spider the web server? If you can index directly from the file system, 1
million documents might take less than 1 hour. There is also a way to index
your back-end database if your site is dynamic... there is a mysql example
included as part of Swish-e... this is also very fast.

In direct answer to your question about how much more time you have to
run... you could probably grep your web log and see how many files your
spider has grabbed. This should give you a good idea how many have been
indexed. If you want to stop the process without toasting your index, you
should kill the perl spider process. (ps -ef|grep spider.pl... kill the perl
process dealing with spider.pl ... do not kill swish-e, or you will lose
everything).



Best regards,

Aaron

http://www.vvx.net/


-----Original Message-----
From: swish-e@sunsite3.berkeley.edu
[mailto:swish-e@sunsite3.berkeley.edu]On Behalf Of Aaron Levitt
Sent: Thursday, September 30, 2004 10:54 AM
To: Multiple recipients of list
Subject: [SWISH-E] swish-e on a large scale


Hello folks-

I am working on a project to index all of our mailing list archives.
Currently we have over 600,000 documents to be indexed so they can be
searched.  I got swish-e 2.4.2 installed and running.  I am searching
using the included spider.pl application so it doesn't affect resources
on the production machine as much.

I began the indexing approximately 72 hours ago, and it hasn't ended
yet.  It is running on a G3 450Mhz machine with  576Mb of RAM.  I can
see swish-e hitting my webserver, and the .temp database seems to
continue to grow.  I ran the indexer with the following command:
/bin/swish-e -S prog -c swish.conf.

So, I have the following questions:

1. I expect to have over 1,000,000 documents in our archives as things
progress.  Is this pushing the limits of swish-e?

2. I have seen the indexer hit my robots.txt multiple times, is there a
way to check on the progress to see if/when it will finish indexing?

3. What should I do regarding the current index process?  I'm afraid to
stop it, because I don't want to have to start the indexing all over
again.

4. Do you have any recommendations on what I can do to improve this
process?

Any help would be greatly appreciated.

-=Aaron
Administrator, lists.apple.com
Received on Thu Sep 30 09:27:33 2004