On Fri, Mar 09, 2007 at 12:46:52PM +0000, Darrell Berry wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Hi -- is there a standard way to just get the *link structure*
> (rather than content index) of a site using the swish-e tools
> (spider.pl i guess)?
>
> all i want from the output of my crawl is something like
>
> www.domain.tld -> www.domain.tld/help
> www.domain.tld -> www.domain.tld/info
> www.domain.tld/info -> www.domain.tld.info2
> www.domain.tld/info -> www.domain.tld
>
> ie just spidering the whole domain and showing which pages link to
> which, recursively -- no content, no indexing...? i can find similar
> questions in the archives, but not a definitive answer -- all help
> appreciated
Try printing the url passed to the test_url() callback. Then dump the
$server parameter passed to see if the partent url (the page where the
url was found) is listed. If not, modify check_link() and stuff $base
into the $server hash.
$server->{parent} = $base;
Then in your test_url() function print out the parent => url.
--
Bill Moseley
moseley@hank.org
Unsubscribe from or help with the swish-e list:
http://swish-e.org/Discussion/
Help with Swish-e:
http://swish-e.org/current/docs
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Fri Mar 9 09:42:13 2007