On Thu, 5 Mar 1998, Simon Wilkinson wrote:
> It depends on whether you're doing the search over just a single site,
For a single site, it's the same as indexing locally, i.e., at
the time, you're not really indexing the "web."
> or across multiple sites. For single site searching depth first can often be
> the best choice, as the search will "bottom out" fairly rapidly and your
> working list will stay smaller than that with breadth first.
Unless, of course, you have deeply-nested directories and you
exhaust the maximum number of open file descriptors allowed on
your Unix implementation.
> For multiple (or unrestricted) site searches - a breadth first approach
> gives a better "snapshot" of a site quicker (as you'll index all of the
> top level pages before moving on down the tree).
Exactly.
> However, when searching multiple sites a large number of other factors
> come into play. In order to be a friendly robot you don't want to be
> sitting hammering away on one site in quick succession - so you pause
> for, say 5 minutes, between fetches from that site. However during the
> time you are paused you can be fetching from other sites.
True, but that has nothing to do with BF vs. DF as you can
hammer a site equally in either case.
> Therefore, a multiple site crawler needs to have a fairly complex search
> scheduling algorithm, certainly more complex than adopting simply a breadth
> or depth first approach ...
And I never said that wasn't the case. I merely said, "better
suited"; I didn't say "ideal."
- Paul J. Lucas
NASA Ames Research Center Caelum Research Corporation
Moffett Field, California San Jose, California
<pjl AT ptolemy DOT arc DOT nasa DOT gov>
Received on Thu Mar 5 15:35:35 1998