On Thu, 5 Mar 1998 00prneubauer@bsuvc.bsu.edu wrote:
> Suppose we select an arbitrary page as our root -- call it "Home".
> Home has links to A, B, C and D, so we record those links and then
> follow them in sequence. We find that A has links to i, ii, iii and
> iv, that B has links to v, vi, vii and viii, etc. Page i then has
> links to a, b, c and ii links to d, e, f and g, etc. But some of
> those links (assume iv) may be off-site.
It's easy to check the domain of the URL to see if it's on the
same site. There are other problems in that links can be
circular and aliased, e.g., "http://foo.com/" is the same as
"http://foo.com/index.html" (usually) so you'd have to perform
some sort of fuzzy URL matching in order to eliminate
duplicates.
> In the context of indexing, "visiting" a node (page) would correspond
> to indexing the contents of that page. Certainly noone with any sense
> whatever would insist that all links from a page must be followed
> before the words of the page can be added to the index. In all
> probability, the entire page will be read and indexed before any of
> the links are followed. Any sensible indexing program will index the
> contents of Home and then follow links to A, B, C and D rather than
> waiting to index Home until after all of A, B, C and D (and everything
> they link to) are indexed. I therefore take issue with your
> characterization of depth-first as never completing a page because of
> following links first.
You can do as you please; however, it's very easy to imagine
that an implementation will follow a link upon encountering it.
It is the "easy" solution since you don't have to queue it.
> Suppose Altavista (or swish++) is currently indexing at level 'n' of
> its search. Level 'n' consists of nodes 3,123,456-3,654,321 and while
> indexing node 3,245,678 it finds a link to your page at
> http://www.best.com/~pjl/software.html.
>
> It records that url as node 6,743,822, but, of course, it does not yet
> actually index it, since it is part of level 'n+1' rather than level
> 'n'.
Then it's not depth-first. I'll answer questions on SWISH++,
but I don't see it as my (or anyone's task) to be a substitute
for an algorithms course or textbook.
> Does my explanation above provide any insight regarding why I (still) have
> trouble understanding why you regard breadth-first as more suitable?
Not really.
> I also fail to see the relevance of an algorithms text here. You say below
> that efficiency is not the reason. Most algorithms texts of my acquaintence
> concern themselves primarily with efficiency.
To me it seems like you don't understand the basic algorithm.
> My question had to do with *why* you felt that breadth-first was better.
And I stated so in as clear English as I can. I can say it
again if you like.
> Just as there is a simple explanation for *why* any of several
> different sorting algorithms beat the conceptually simpler bubble sort
> (and each other depending on circumstances), I was assuming that there
> was a corresponding explanation for why swish++ is faster than
> swish-e.
There is: read the README.
> Usually, when you get an order-of-magnitude improvement, it is because of a
> better, faster algorithm, not just from a better implementation of the same
> algorithm.
Absolutely wrong. Please, please, consult an algorithms text.
An order of magnitude improvment (or even thousands of orders of
magnitude improvement) has *NOTHING* to do with algorithms
because such are *CONSTANT* *FACTORS* in the polynomial of the
running time. The order of 10n is O(n).
There are several factors that contribute to the speed
improvement of SWISH++ (and being breadth-first vs. depth-first
has *NOTHING* to do with it nor did I ever say it did). SWISH++
essentially uses exactly the same algorithm for indexing words;
however, it does use a better data structure for storing the
word index.
I could be wrong, but it appears that SWISH-E uses an
unbalanced binary tree. The worst case running time for
inserts and lookups is O(n) because the degenerate case for
such a tree is a linked list. SWISH++ uses a balanced binary
tree. The GNU STL implementation of the map class uses a
red-black tree (but an AVL tree would do nicely as well).
There, the worst case running time for inserts and lookups is
O(lg n) -- BIG DIFFERENCE!
So while *I* personally am not using a different algorithm
overall, the implementation of the data structure I'm using "off
the shelf" is far superior.
Also, my "order of magnitude" claim is based on a few test
cases. I encourage others to compare SWISH-E and SWISH++ and
share their results.
> I incorrectly assumed that the algorithm mentioned in your message to the
> mailing list was the algorithm that was responsible for the improvement.
It's only partly responsible, but for large data sets, may turn
out to be the dominant reason even though I never stated so.
- Paul J. Lucas
NASA Ames Research Center Caelum Research Corporation
Moffett Field, California San Jose, California
<pjl AT ptolemy DOT arc DOT nasa DOT gov>
Received on Fri Mar 6 09:42:00 1998