On Sun, Dec 07, 2003 at 04:10:17AM -0800, Dave Stevens wrote:
> >> This machine is a single Athlon XP 1800+ with a an inexpensive Asus K7
> >> board and only 512 MB of RAM.
> >
> > I assume you are using -e when indexing.
>
> Actually no. I was more or less looking at the limitations using Swish-e
> knowing it it wasn't really intended to handle somewhat massive data sets
> and that my resources (particularly RAM) might be taxed. I'll try the -e
> switch on the next crawl and see if it helps. I'm adding another 512MB of
> RAM and see what that does. When I SIGHUPed the last crawl at 96 hours,
> the pages were being returned at the rate of two or less a minute, much
> slower than any of the other crawls.
Of course, RAM is faster than disk. But not if you use it all up. -e
helps by flushing to disk as it's indexing (after each document). But,
-e may be faster than RAM after some point, regardless of how much RAM
you have because the hash tables get so big.
The word index, for example, is a table indexed by the hash value of the
word. But there can be lots of words with that same hash value. So,
once swish-e finds the hash index it does a sequential walk to find the
right word/file/metaname/position to add the word to the index.
Jose has added a btree backend, although I'm not sure it helps the
problem of full hash tables.
Swish-e also maintains some simple tables the size of the number of
documents indexed (table[num_documents]), so those will eat RAM and
require time to load off disk for every search. They are very helpful
for speed, but can become an issue for very large document sets.
> Another issue I could see was that after the spider had been crawling for
> several hours, unless Apache was restarted a couple of times a day, PHP
> performance would tank to the point where PHP/mysql pages would take
> upwards of a minute to serve, though html and cgi/DBI/mysql pages loaded a
> little slower than normal, still acceptable. Just don't have enough box
> at the moment.
Sounds like a swapping problem if it happens after indexing for a while.
Interesting stuff about Lucene and Nutch. Doug Cutting spoke in
Berkeley a few months back about search engines and I was sad I couldn't
make it that day.
Here's a place to try nutch:
http://research.overture.com/demo/nutch/
Although about 1/2 the time I get:
Error 404: File Not Found
The page '/error.html' could not be found. Please check that you did not mistype the URL. If you followed a link to this page, we appologize for the error.
oops. ;)
Plus if you search for "Perl" Matt's Script Archive is the first hit.
Clearly broken... ;)
--
Bill Moseley
moseley@hank.org
Received on Sun Dec 7 14:38:55 2003