Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] Regarding scalibilty and multithreading in Swish-e

From: <kumar.nitin(at)not-real.wipro.com>
Date: Tue Feb 19 2008 - 17:52:21 GMT
Hi,

 

Please find the attached config file (test.config) and test output
(output1_depth_local.txt) of the same. 

 

In output1_depth_local.txt:

 

if you search for a link http://channelw.wipro.com
<http://channelw.wipro.com/> .(One of the link)It is not connecting and
downloading page as we have given max_depth =1.  

 

Below is the exact error coming in file output1_depth_local.txt:

____________________________________________________________________

 

Looking at extracted tag '<a href="http://channelw.wipro.com">'

check_link:http://channelw.wipro.com

 ?? <a href="http://channelw.wipro.com"> skipped because different host

 ?? <a href="http://channelw.wipro.com"> skipped because different host

  tag did not include any links to follow or is a duplicate

 

_______________________________________________________________________

 

Please give your valuable input into this and let us know the
resulution.

 

 

With Regards,

Nitin Kumar

+91-9999499757

 

-----Original Message-----
From: users-bounces@lists.swish-e.org
[mailto:users-bounces@lists.swish-e.org] On Behalf Of Peter Karman
Sent: Tuesday, February 19, 2008 8:43 PM
To: Swish-e Users Discussion List
Subject: Re: [swish-e] Regarding scalibilty and multithreading in
Swish-e

 

 

 

On 02/19/2008 07:28 AM, kumar.nitin@wipro.com wrote:

 

> 

> In the above scenario, while crawling *'http://learning.wipro.com',
*it

> gives me all links at page in case of option used *max_depth=0.*

> 

>  

> 

> But in case of *max_depth=1*, when it is trying to connect to
different

> host like *http://channelw.wipro.com <http://channelw.wipro.com/>, *it

> is failing.

> 

>  

> 

> Please let us know how we can resolve this problem so that at depth-1,

> we can achieve our functionality.

 

http://swish-e.org/docs/spider.html#max_depth

 

max_depth should not affect whether the spider follows links to hosts
other than the one

you have specified as the base. You don't have channelw.wipro.com
anywhere in your config,

so the spider ignores it.

 

If you can send a small, reproduce-able test case, we can try and find
the problem. IME,

putting a test case like that together will usually reveal what I'm
doing wrong.

 

> If I am crawling multiple URLs at a time, how can it balance the load?

> Like *multithreading*.

> 

 

spider.pl does not do threading or parallel fetches of any kind. It's
all serial. Run

multiple spider.pl instances, one for each site, if you need that kind
of feature.

 

-- 

Peter Karman  .  peter(at)not-real.peknet.com  .  http://peknet.com/

 

_______________________________________________

Users mailing list

Users@lists.swish-e.org

http://lists.swish-e.org/listinfo/users


The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. 

WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email.

www.wipro.com



_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Tue Feb 19 12:52:44 2008