Skip to main content.
home | support | download

Back to List Archive

Re: [SWISH-E:177] Re: Fw: Re: More?

From: Simon Wilkinson <sxw(at)not-real.tardis.ed.ac.uk>
Date: Thu Mar 05 1998 - 17:31:42 GMT
> 	The reason is simple: depth first would take an extremely long
> 	time to complete and blow your stack moste likely since, every
> 	time you come to a link, you'd follow it to another page and
> 	possibly to another site.  You'd never get around to finishing
> 	the site you started at until months later.  Breadth-first
> 	would completes an entire page (and therefore an entire sitre)
> 	at a time before moving on.

It depends on whether you're doing the search over just a single site, or
across multiple sites. For single site searching depth first can often be
the best choice, as the search will "bottom out" fairly rapidly and your
working list will stay smaller than that with breadth first. Harvest,
in particular, has large memory problems due to the growth of the working
list when used in breadth first mode.

For multiple (or unrestricted) site searches - a breadth first approach
gives a better "snapshot" of a site quicker (as you'll index all of the
top level pages before moving on down the tree).

However, when searching multiple sites a large number of other factors
come into play. In order to be a friendly robot you don't want to be
sitting hammering away on one site in quick succession - so you pause
for, say 5 minutes, between fetches from that site. However during the
time you are paused you can be fetching from other sites. Similarly if
one site is being exceptionaly slow (or is down) then you don't want
to keep retrying that site with the same frequency as you access other,
awake, sites.

Therefore, a multiple site crawler needs to have a fairly complex search
scheduling algorithm, certainly more complex than adopting simply a breadth
or depth first approach ...

Cheers,

Simon.
Received on Thu Mar 5 09:43:29 1998