Skip to main content.
home | support | download

Back to List Archive

Re: HTTP method and swishspider

From: David Norris <dave(at)not-real.webaugur.com>
Date: Mon Sep 25 2000 - 06:38:16 GMT
jmruiz@boe.es wrote:
> I have noticed that this option is slow. I am wondering why.

Could that be related to the waiting period between requests?  Some
sites will automatically block IPs making consecutive requests very
quickly as a denial of service protection.  There should be some way to
change the delay in case you are indexing your own server.  For sites
you don't manage there is an informal agreement among robot writers to
keep requests below a certian rate.  Robot Guidelines 
<http://info.webcrawler.com/mak/projects/robots/guidelines.html>

> I am wondering if there is a way to avoid the use of  swishspider.

I have looked at a few options in the past.  GNU WGet may be a good
option.  WGet is a very flexible mirroring robot written in C which
supports HTTP and FTP.  I attempted to interface WGet in a manner
compatible with swishspider.  I had trouble extracting URLs from the
downloaded file with WGet.  WGet doesn't dump a list of URLs although it
stores one in an internal structure.  I think it would be easier to
interface WGet into SWISH-E in a manner different than swishspider.

I should have a copy of Mark Gaulin's swishspider written in C.  I don't
believe he ever publicly released the source.  You might want to email
him and ask about it, if possible (I'm not sure he's still on this
list).  If I can find the source I will send you a copy.  It builds and
runs well on Windows.  It needs some porting to Unix.

-- 
,David Norris
  Dave's Web - http://www.webaugur.com/dave/
  Dave's Weather - http://www.webaugur.com/dave/wx
  ICQ Universal Internet Number - 412039
  E-Mail - dave@webaugur.com

"I would never belong to a club that would have me as a member!"
                                          - Groucho Marx
Received on Mon Sep 25 09:38:11 2000