On Feb 19, 2008, at 9:52 AM, <kumar.nitin@wipro.com>
<kumar.nitin@wipro.com> wrote:
> Hi,
>
>
>
> Please find the attached config file (test.config) and test output
> (output1_depth_local.txt) of the same.
>
>
>
> In output1_depth_local.txt:
>
>
>
> if you search for a link http://channelw.wipro.com.(One of the link)
> It is not connecting and downloading page as we have given
> max_depth =1.
>
>
>
> Below is the exact error coming in file output1_depth_local.txt:
>
> ____________________________________________________________________
>
>
>
> Looking at extracted tag '<a href="http://channelw.wipro.com">'
>
> check_link:http://channelw.wipro.com
>
> ?? <a href="http://channelw.wipro.com"> skipped because different
> host
>
> ?? <a href="http://channelw.wipro.com"> skipped because different
> host
>
> tag did not include any links to follow or is a duplicate
>
>
>
is telling you that it is a "different host": www.wipro.com !=
channelw.wipro.com
The spider does not follow links to different hosts (see the docs).
depth refers to the host being spidered.
Bill
> ______________________________________________________________________
> _
>
>
>
> Please give your valuable input into this and let us know the
> resulution.
>
>
>
>
>
> With Regards,
>
> Nitin Kumar
>
> +91-9999499757
>
>
>
> -----Original Message-----
> From: users-bounces@lists.swish-e.org [mailto:users-
> bounces@lists.swish-e.org] On Behalf Of Peter Karman
> Sent: Tuesday, February 19, 2008 8:43 PM
> To: Swish-e Users Discussion List
> Subject: Re: [swish-e] Regarding scalibilty and multithreading in
> Swish-e
>
>
>
>
>
>
>
> On 02/19/2008 07:28 AM, kumar.nitin@wipro.com wrote:
>
>
>
> >
>
> > In the above scenario, while crawling *'http://
> learning.wipro.com', *it
>
> > gives me all links at page in case of option used *max_depth=0.*
>
> >
>
> >
>
> >
>
> > But in case of *max_depth=1*, when it is trying to connect to
> different
>
> > host like *http://channelw.wipro.com <http://channelw.wipro.com/
> >, *it
>
> > is failing.
>
> >
>
> >
>
> >
>
> > Please let us know how we can resolve this problem so that at
> depth-1,
>
> > we can achieve our functionality.
>
>
>
> http://swish-e.org/docs/spider.html#max_depth
>
>
>
> max_depth should not affect whether the spider follows links to
> hosts other than the one
>
> you have specified as the base. You don't have channelw.wipro.com
> anywhere in your config,
>
> so the spider ignores it.
>
>
>
> If you can send a small, reproduce-able test case, we can try and
> find the problem. IME,
>
> putting a test case like that together will usually reveal what I'm
> doing wrong.
>
>
>
> > If I am crawling multiple URLs at a time, how can it balance the
> load?
>
> > Like *multithreading*.
>
> >
>
>
>
> spider.pl does not do threading or parallel fetches of any kind.
> It's all serial. Run
>
> multiple spider.pl instances, one for each site, if you need that
> kind of feature.
>
>
>
> --
>
> Peter Karman . peter(at)not-real.peknet.com . http://peknet.com/
>
>
>
> _______________________________________________
>
> Users mailing list
>
> Users@lists.swish-e.org
>
> http://lists.swish-e.org/listinfo/users
>
> The information contained in this electronic message and any
> attachments to this message are intended for the exclusive use of
> the addressee(s) and may contain proprietary, confidential or
> privileged information. If you are not the intended recipient, you
> should not disseminate, distribute or copy this e-mail. Please
> notify the sender immediately and destroy all copies of this
> message and any attachments.
>
> WARNING: Computer viruses can be transmitted via email. The
> recipient should check this email and any attachments for the
> presence of viruses. The company accepts no liability for any
> damage caused by any virus transmitted by this email.
>
> www.wipro.com
>
> <output1_depth_local.txt><test.config>
> _______________________________________________
> Users mailing list
> Users@lists.swish-e.org
> http://lists.swish-e.org/listinfo/users
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Tue Feb 19 13:50:58 2008