Skip to main content.
home | support | download

Back to List Archive

Re: Lucene/Nutch (WAS: converting .temp indices...)

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Sun Dec 07 2003 - 14:38:52 GMT
On Sun, Dec 07, 2003 at 04:10:17AM -0800, Dave Stevens wrote:
> >> This machine is a single Athlon XP 1800+ with a an inexpensive Asus K7
> >> board and only 512 MB of RAM.
> >
> > I assume you are using -e when indexing.
> 
> Actually no.  I was more or less looking at the limitations using Swish-e
> knowing it it wasn't really intended to handle somewhat massive data sets
> and that my resources (particularly RAM) might be taxed.  I'll try the -e
> switch on the next crawl and see if it helps.  I'm adding another 512MB of
> RAM and see what that does.  When I SIGHUPed the last crawl at 96 hours,
> the pages were being returned at the rate of two or less a minute, much
> slower than any of the other crawls.

Of course, RAM is faster than disk.  But not if you use it all up.  -e 
helps by flushing to disk as it's indexing (after each document).  But, 
-e may be faster than RAM after some point, regardless of how much RAM 
you have because the hash tables get so big.

The word index, for example, is a table indexed by the hash value of the 
word.  But there can be lots of words with that same hash value.  So, 
once swish-e finds the hash index it does a sequential walk to find the 
right word/file/metaname/position to add the word to the index.

Jose has added a btree backend, although I'm not sure it helps the 
problem of full hash tables.

Swish-e also maintains some simple tables the size of the number of 
documents indexed (table[num_documents]), so those will eat RAM and 
require time to load off disk for every search.  They are very helpful 
for speed, but can become an issue for very large document sets.

> Another issue I could see was that after the spider had been crawling for
> several hours, unless Apache was restarted a couple of times a day, PHP
> performance would tank to the point where PHP/mysql pages would take
> upwards of a minute to serve, though html and cgi/DBI/mysql pages loaded a
> little slower than normal, still acceptable.  Just don't have enough box
> at the moment.

Sounds like a swapping problem if it happens after indexing for a while.

Interesting stuff about Lucene and Nutch.  Doug Cutting spoke in 
Berkeley a few months back about search engines and I was sad I couldn't 
make it that day.

Here's a place to try nutch:

  http://research.overture.com/demo/nutch/

Although about 1/2 the time I get:

Error 404: File Not Found
The page '/error.html' could not be found. Please check that you did not mistype the URL. If you followed a link to this page, we appologize for the error.

oops. ;)

Plus if you search for "Perl" Matt's Script Archive is the first hit.  
Clearly broken... ;)


-- 
Bill Moseley
moseley@hank.org
Received on Sun Dec 7 14:38:55 2003