Skip to main content.
home | support | download

Back to List Archive

Re: [SWISH-E:177] Re: Fw: Re: More?

From: Paul J. Lucas <pjl(at)not-real.ptolemy.arc.nasa.gov>
Date: Thu Mar 05 1998 - 18:40:39 GMT
On Thu, 5 Mar 1998, Simon Wilkinson wrote:

> It depends on whether you're doing the search over just a single site,

	For a single site, it's the same as indexing locally, i.e., at
	the time, you're not really indexing the "web."

> or across multiple sites.  For single site searching depth first can often be
> the best choice, as the search will "bottom out" fairly rapidly and your
> working list will stay smaller than that with breadth first.

	Unless, of course, you have deeply-nested directories and you
	exhaust the maximum number of open file descriptors allowed on
	your Unix implementation.

> For multiple (or unrestricted) site searches - a breadth first approach
> gives a better "snapshot" of a site quicker (as you'll index all of the
> top level pages before moving on down the tree).

	Exactly.

> However, when searching multiple sites a large number of other factors
> come into play. In order to be a friendly robot you don't want to be
> sitting hammering away on one site in quick succession - so you pause
> for, say 5 minutes, between fetches from that site. However during the
> time you are paused you can be fetching from other sites.

	True, but that has nothing to do with BF vs. DF as you can
	hammer a site equally in either case.

> Therefore, a multiple site crawler needs to have a fairly complex search
> scheduling algorithm, certainly more complex than adopting simply a breadth
> or depth first approach ...

	And I never said that wasn't the case.  I merely said, "better
	suited"; I didn't say "ideal."

	- Paul J. Lucas
	  NASA Ames Research Center		Caelum Research Corporation
	  Moffett Field, California		San Jose, California
	  <pjl AT ptolemy DOT arc DOT nasa DOT gov>
Received on Thu Mar 5 15:35:35 1998