Skip to main content.
home | support | download

Back to List Archive

Re: spider.pl

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Wed Aug 16 2006 - 19:53:06 GMT
On Wed, Aug 16, 2006 at 12:45:31PM -0700, Z wrote:
> >From a command propt I tried this: 
>  
>  >spider.pl default http://www.swish-e.org/ > output.txt
>  
>  The result was:
>  E:\INETPUB\WWWROOT\SITE\WINDOWS\spider.pl: Reading parameters from 'default' 
>  Summary for: http://www.swish-e.org/ 
>  Connection: Close: 1 (0.0/sec)
>          Unique URLs: 1 (0.0/sec)

did you read about debugging?

$ perldoc /usr/local/lib/swish-e/spider.pl | grep DEBUG                             
           DEBUG_SKIPPED debug flag is set.
           DER_DEBUG when running spider.pl.  You can specify any of the above
               SPIDER_DEBUG=url,links spider.pl [....]
               debug => DEBUG_URL | DEBUG_FAILED | DEBUG_SKIPPED,
           DEBUG_* constants.  The string is converted to a number only at the
           Now you can use the words instead of or'ing the DEBUG_* constants

This is for a file forbidden by robots.txt:

$ SPIDER_DEBUG=failed,url,links,headers perl /usr/local/lib/swish-e/spider.pl default http://www.swish-e.org/who.html
/usr/local/lib/swish-e/spider.pl: Reading parameters from 'default'

 -- Starting to spider: http://www.swish-e.org/who.html --

vvvvvvvvvvvvvvvv HEADERS for http://www.swish-e.org/who.html vvvvvvvvvvvvvvvvvvvvv

---- Request ------
HEAD http://www.swish-e.org/who.html
Accept-Encoding: gzip; deflate


---- Response ---
Status: 403 Forbidden by robots.txt

^^^^^^^^^^^^^^^ END HEADERS ^^^^^^^^^^^^^^^^^^^^^^^^^^


Summary for: http://www.swish-e.org/who.html
Connection: Close: 1  (1.0/sec)
      Unique URLs: 1  (1.0/sec)
       robots.txt: 1  (1.0/sec)



-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Wed Aug 16 12:53:07 2006