Skip to main content.
home | support | download

Back to List Archive

Re: on spidering

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Mon Oct 27 2003 - 04:37:47 GMT
On Sun, Oct 26, 2003 at 05:00:03PM -0800, Tech Support FaenWorks wrote:
> the deal is:: shouldn't it auto-spider to html sites that the original page
> is mentioning?  I need it to be able to spider to all of the members of a
> webring lets say.  we start it on the first server... it has a link to two
> sites... it should index the two child sites.... what do I need to do to do
> that?

How is it going to know when to stop?  It's called a "web" after all.

With spider.pl you can define more than one site to index at a time.
There's also a "same_hosts" configuration, but that's for where you
use links with www.example.com and example.com in the same site and they
point to the same pages.  

You can't use same_hosts for this because it will rewrite all your links to the
initial host.  It does that so that it can track what pages it has seen.
You don't really want to index www.example.com/index.html and
example.com/index.html as two different pages if they are the same.

spider.pl could probably still track those pages but not rewrite the
host name (e.g. www.example.com -> example.com), but that's not the way
it currently works.

If you want to change the behaviour look at the check_link() function in
spider.pl.


-- 
Bill Moseley
moseley@hank.org
Received on Mon Oct 27 04:50:01 2003