Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] Regarding scalibilty and multithreading in Swish-e

From: William M Conlon <bill(at)not-real.tothept.com>
Date: Tue Feb 19 2008 - 18:50:53 GMT
On Feb 19, 2008, at 9:52 AM, <kumar.nitin@wipro.com>  
<kumar.nitin@wipro.com> wrote:

> Hi,
>
>
>
> Please find the attached config file (test.config) and test output  
> (output1_depth_local.txt) of the same.
>
>
>
> In output1_depth_local.txt:
>
>
>
> if you search for a link http://channelw.wipro.com.(One of the link) 
> It is not connecting and downloading page as we have given  
> max_depth =1.
>
>
>
> Below is the exact error coming in file output1_depth_local.txt:
>
> ____________________________________________________________________
>
>
>
> Looking at extracted tag '<a href="http://channelw.wipro.com">'
>
> check_link:http://channelw.wipro.com
>
>  ?? <a href="http://channelw.wipro.com"> skipped because different  
> host
>
>  ?? <a href="http://channelw.wipro.com"> skipped because different  
> host
>
>   tag did not include any links to follow or is a duplicate
>
>
>

is telling you that it is a "different host":   www.wipro.com !=   
channelw.wipro.com

The spider does not follow links to different hosts (see the docs).  
depth refers to the host being spidered.

Bill




> ______________________________________________________________________ 
> _
>
>
>
> Please give your valuable input into this and let us know the  
> resulution.
>
>
>
>
>
> With Regards,
>
> Nitin Kumar
>
> +91-9999499757
>
>
>
> -----Original Message-----
> From: users-bounces@lists.swish-e.org [mailto:users- 
> bounces@lists.swish-e.org] On Behalf Of Peter Karman
> Sent: Tuesday, February 19, 2008 8:43 PM
> To: Swish-e Users Discussion List
> Subject: Re: [swish-e] Regarding scalibilty and multithreading in  
> Swish-e
>
>
>
>
>
>
>
> On 02/19/2008 07:28 AM, kumar.nitin@wipro.com wrote:
>
>
>
> >
>
> > In the above scenario, while crawling *'http:// 
> learning.wipro.com', *it
>
> > gives me all links at page in case of option used *max_depth=0.*
>
> >
>
> >
>
> >
>
> > But in case of *max_depth=1*, when it is trying to connect to  
> different
>
> > host like *http://channelw.wipro.com <http://channelw.wipro.com/ 
> >, *it
>
> > is failing.
>
> >
>
> >
>
> >
>
> > Please let us know how we can resolve this problem so that at  
> depth-1,
>
> > we can achieve our functionality.
>
>
>
> http://swish-e.org/docs/spider.html#max_depth
>
>
>
> max_depth should not affect whether the spider follows links to  
> hosts other than the one
>
> you have specified as the base. You don't have channelw.wipro.com  
> anywhere in your config,
>
> so the spider ignores it.
>
>
>
> If you can send a small, reproduce-able test case, we can try and  
> find the problem. IME,
>
> putting a test case like that together will usually reveal what I'm  
> doing wrong.
>
>
>
> > If I am crawling multiple URLs at a time, how can it balance the  
> load?
>
> > Like *multithreading*.
>
> >
>
>
>
> spider.pl does not do threading or parallel fetches of any kind.  
> It's all serial. Run
>
> multiple spider.pl instances, one for each site, if you need that  
> kind of feature.
>
>
>
> --
>
> Peter Karman  .  peter(at)not-real.peknet.com  .  http://peknet.com/
>
>
>
> _______________________________________________
>
> Users mailing list
>
> Users@lists.swish-e.org
>
> http://lists.swish-e.org/listinfo/users
>
> The information contained in this electronic message and any  
> attachments to this message are intended for the exclusive use of  
> the addressee(s) and may contain proprietary, confidential or  
> privileged information. If you are not the intended recipient, you  
> should not disseminate, distribute or copy this e-mail. Please  
> notify the sender immediately and destroy all copies of this  
> message and any attachments.
>
> WARNING: Computer viruses can be transmitted via email. The  
> recipient should check this email and any attachments for the  
> presence of viruses. The company accepts no liability for any  
> damage caused by any virus transmitted by this email.
>
> www.wipro.com
>
> <output1_depth_local.txt><test.config>
> _______________________________________________
> Users mailing list
> Users@lists.swish-e.org
> http://lists.swish-e.org/listinfo/users

_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Tue Feb 19 13:50:58 2008