Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] Regarding scalibilty and multithreading in Swish-e

From: Peter Karman <peter(at)not-real.peknet.com>
Date: Tue Feb 19 2008 - 15:13:29 GMT
On 02/19/2008 07:28 AM, kumar.nitin@wipro.com wrote:

> 
> In the above scenario, while crawling *'http://learning.wipro.com', *it
> gives me all links at page in case of option used *max_depth=0.*
> 
>  
> 
> But in case of *max_depth=1*, when it is trying to connect to different
> host like *http://channelw.wipro.com <http://channelw.wipro.com/>, *it
> is failing.
> 
>  
> 
> Please let us know how we can resolve this problem so that at depth-1,
> we can achieve our functionality.

http://swish-e.org/docs/spider.html#max_depth

max_depth should not affect whether the spider follows links to hosts other than the one
you have specified as the base. You don't have channelw.wipro.com anywhere in your config,
so the spider ignores it.

If you can send a small, reproduce-able test case, we can try and find the problem. IME,
putting a test case like that together will usually reveal what I'm doing wrong.

> If I am crawling multiple URLs at a time, how can it balance the load?
> Like *multithreading*.
> 

spider.pl does not do threading or parallel fetches of any kind. It's all serial. Run
multiple spider.pl instances, one for each site, if you need that kind of feature.

-- 
Peter Karman  .  peter(at)not-real.peknet.com  .  http://peknet.com/

_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Tue Feb 19 10:13:30 2008