Skip to main content.
home | support | download

Back to List Archive

Re: Merging vs. Spider

From: Bill Moseley <moseley(at)>
Date: Thu Oct 03 2002 - 16:51:28 GMT
At 09:04 AM 10/03/02 -0700, David VanHook wrote:
>So, I'm confronted with a couple of possible solutions, both of which have
>potential problems:
>1) Generate a FULL index of the entire site, all 20,000 pages.  Store that
>as the MAIN index.  Then run daily incremental indexes, using a timestamp,
>and merge each of these incremental updates with the main index, always
>keeping a copy of the main index as backup.  As long as we don't flush
>cache, the incremental updates will only contain files created since the
>main index, so there will be no overlap.
>Potential problem:  If we do a flush cache on the whole site, the daily
>incremental update is going to contain thousands of files -- both new files
>AND new versions of files already in the main index.  When we do a merge,
>will all of these items show up twice on search results, since they are in
>both indexes?  They'd have the same filenames, and otherwise be pretty much

Merge compares the path names, and when duplicates are found only the one
with the newest date is kept.  I'm not a huge fan of merge -- it works much
better now (thank you Jose!), but I'd rather keep a local cache of updated
files and index those in one shot.

>2) Run the on the entire site on a nightly basis, which would
>force our server to re-generate any pages deleted via a flush cache.

This would probably be my suggestion -- spider the caching server. That
will let the caching server do its work of deciding what pages to return
from the cache vs. fetch from the server.

>Potential problem:  How long would it take to index 20,000 files?
>Right now, using the filesystem, it takes about 25 minutes to run both our
>full indexes -- we've got two separate ones, one for Fuzzy searching, and
>one regular.

25 minutes seems a little long for indexing 20,000 static files, but
perhaps they are large -- takes about 15 minutes on to index
something around 50,000 files, IIRC.

>Would the spider take 2 or 3 times that, or would it take 20
>or 30 times that?

Make sure you have a current installation of the LWP Perl libraries and use
the keep_alive feature (if your server supports that -- it will save on the
connection time and the number of processes (say, if running a forking
server) needed to handle the requests.  

Are you using a caching proxy?  Do you always need to flush the cache when
you make an update to the site?  Or could you let the normal cache control
headers control when the proxy updates its cache?

The basic problem is the docs are dynamically generated so you have to
fetch them from a web server at some point.  But I suppose there could be
some optimizations:

For example, if your web space (i.e. cache) mirrors the file system, then
use the file system for "spidering" and when you try to follow a link that
doesn't exist then you then you ask the web server for it. could
do this with minor changes.  In test_url() you first rewrite the URL into a
path, check for the file on disk, if there use that, if not you let the
spider fetch it from the server.

Sorry, I'm not offering much help.   Please report back with whatever you
come up with, ok?

Bill Moseley
Received on Thu Oct 3 16:58:35 2002