Will swish-e index *very* large sites?

From: Ace <aceop(at)>
Date: Mon Jul 29 2002 - 10:24:18 GMT
Hi there!

In the moment I use htdig to index the site of one a big university in 
germany (in fact, the university of erlangen-nuremberg,, that has about 21.000 students). I'm very 
happy with htdig's ability to index parts of the site, but when it comes 
to the all-over-index that should include all sites hosted by the 
universities computing center, all (stable) versions of htdig I know 
either just crash (in fact it might be that it doesn't crash, but after 
2 days of consuming 100% CPU time without any progress visible to the 
rest of the world I find it crashed) or run into 2-GB-filesize-limit 
problems (at least on Linux, which is the platform the search engine 
must run on) which are not caused by kernel or filesystem limitations 
(kernel 2.4.18 and ext3 filesystem, both capable of up to some TB big 
files). So what I need is a search engine that will index .doc, .pdf, 
.ps and all kinds of html and text, that can also deal with umlauts, 
which doesn't crash when the ammount of data to be indexed is a bit 
bigger than usual and that will return search results within reasonable 
time though the database might be of some GB of size.

The available hardware is an HP server with 2 Pentium III Xeon 1 GHz 
CPUs, 1 GB memory and 100 GB SCSI RAID-30 disk space. The server has 
nothing else to do than hosting the search engine (and it's webserver).
Maybe you can give me a hint if I should try swish-e and if I can make 
use of both CPUs, if swish-e has incremental indexing... and so on. I 
have no problem using a bleeding-edge development version as long as 
this version is not capable of breaking out of a chroot (so no matter 
what the version is doing it won't harm the rest of the installation).

Peter Asemann
(almost) desperate part-time search-engine administrator looking for 
something that works.
