Skip to main content.
home | support | download

Back to List Archive

Re: [SWISH-E:181] Re: More

From: Paul J. Lucas <pjl(at)not-real.ptolemy.arc.nasa.gov>
Date: Fri Mar 06 1998 - 17:33:28 GMT
On Thu, 5 Mar 1998 00prneubauer@bsuvc.bsu.edu wrote:

> Suppose we select an arbitrary page as our root -- call it "Home". 
> Home has links to A, B, C and D, so we record those links and then
> follow them in sequence.  We find that A has links to i, ii, iii and
> iv, that B has links to v, vi, vii and viii, etc.  Page i then has
> links to a, b, c and ii links to d, e, f and g, etc.  But some of
> those links (assume iv) may be off-site.

	It's easy to check the domain of the URL to see if it's on the
	same site.  There are other problems in that links can be
	circular and aliased, e.g., "http://foo.com/" is the same as
	"http://foo.com/index.html" (usually) so you'd have to perform
	some sort of fuzzy URL matching in order to eliminate
	duplicates.

> In the context of indexing, "visiting" a node (page) would correspond
> to indexing the contents of that page.  Certainly noone with any sense
> whatever would insist that all links from a page must be followed
> before the words of the page can be added to the index.  In all
> probability, the entire page will be read and indexed before any of
> the links are followed.  Any sensible indexing program will index the
> contents of Home and then follow links to A, B, C and D rather than
> waiting to index Home until after all of A, B, C and D (and everything
> they link to) are indexed.  I therefore take issue with your
> characterization of depth-first as never completing a page because of
> following links first.

	You can do as you please; however, it's very easy to imagine
	that an implementation will follow a link upon encountering it.
	It is the "easy" solution since you don't have to queue it.

> Suppose Altavista (or swish++) is currently indexing at level 'n' of
> its search.  Level 'n' consists of nodes 3,123,456-3,654,321 and while
> indexing node 3,245,678 it finds a link to your page at
> http://www.best.com/~pjl/software.html.  
> 
> It records that url as node 6,743,822, but, of course, it does not yet
> actually index it, since it is part of level 'n+1' rather than level
> 'n'.

	Then it's not depth-first.  I'll answer questions on SWISH++,
	but I don't see it as my (or anyone's task) to be a substitute
	for an algorithms course or textbook.

> Does my explanation above provide any insight regarding why I (still) have
> trouble understanding why you regard breadth-first as more suitable?

	Not really.

> I also fail to see the relevance of an algorithms text here.  You say below
> that efficiency is not the reason.  Most algorithms texts of my acquaintence
> concern themselves primarily with efficiency.

	To me it seems like you don't understand the basic algorithm.

> My question had to do with *why* you felt that breadth-first was better.

	And I stated so in as clear English as I can.  I can say it
	again if you like.

> Just as there is a simple explanation for *why* any of several
> different sorting algorithms beat the conceptually simpler bubble sort
> (and each other depending on circumstances), I was assuming that there
> was a corresponding explanation for why swish++ is faster than
> swish-e.

	There is: read the README.

> Usually, when you get an order-of-magnitude improvement, it is because of a
> better, faster algorithm, not just from a better implementation of the same
> algorithm.

	Absolutely wrong.  Please, please, consult an algorithms text.
	An order of magnitude improvment (or even thousands of orders of
	magnitude improvement) has *NOTHING* to do with algorithms
	because such are *CONSTANT* *FACTORS* in the polynomial of the
	running time.  The order of 10n is O(n).

	There are several factors that contribute to the speed
	improvement of SWISH++ (and being breadth-first vs. depth-first
	has *NOTHING* to do with it nor did I ever say it did).  SWISH++
	essentially uses exactly the same algorithm for indexing words;
	however, it does use a better data structure for storing the
	word index.

	I could be wrong, but it appears that SWISH-E uses an
	unbalanced binary tree.  The worst case running time for
	inserts and lookups is O(n) because the degenerate case for
	such a tree is a linked list.  SWISH++ uses a balanced binary
	tree.  The GNU STL implementation of the map class uses a
	red-black tree (but an AVL tree would do nicely as well).
	There, the worst case running time for inserts and lookups is
	O(lg n) -- BIG DIFFERENCE!

	So while *I* personally am not using a different algorithm
	overall, the implementation of the data structure I'm using "off
	the shelf" is far superior.

	Also, my "order of magnitude" claim is based on a few test
	cases.  I encourage others to compare SWISH-E and SWISH++ and
	share their results.

> I incorrectly assumed that the algorithm mentioned in your message to the
> mailing list was the algorithm that was responsible for the improvement.

	It's only partly responsible, but for large data sets, may turn
	out to be the dominant reason even though I never stated so.

	- Paul J. Lucas
	  NASA Ames Research Center		Caelum Research Corporation
	  Moffett Field, California		San Jose, California
	  <pjl AT ptolemy DOT arc DOT nasa DOT gov>
Received on Fri Mar 6 09:42:00 1998