Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] just extracting link structure, not indexing content

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Fri Mar 09 2007 - 14:46:08 GMT
On Fri, Mar 09, 2007 at 12:46:52PM +0000, Darrell Berry wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Hi -- is there a standard way to just get the *link structure*  
> (rather than content index) of a site using the swish-e tools  
> (spider.pl i guess)?
> 
> all i want from the output of my crawl is something like
> 
> www.domain.tld -> www.domain.tld/help
> www.domain.tld -> www.domain.tld/info
> www.domain.tld/info -> www.domain.tld.info2
> www.domain.tld/info -> www.domain.tld
> 
> ie just spidering the whole domain and showing which pages link to  
> which, recursively -- no content, no indexing...? i can find similar  
> questions in the archives, but not a definitive answer -- all help  
> appreciated

Try printing the url passed to the test_url() callback.  Then dump the
$server parameter passed to see if the partent url (the page where the
url was found) is listed.  If not, modify check_link() and stuff $base
into the $server hash.

    $server->{parent} = $base;

Then in your test_url() function print out the parent => url.

-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs

_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Fri Mar 9 09:42:13 2007