Skip to main content.
home | support | download

Back to List Archive

RE: HTTP Crawler

From: Hsiao Ketung Contr 61 CS/SCBN <KETUNG.HSIAO(at)not-real.LOSANGELES.AF.MIL>
Date: Thu May 02 2002 - 18:24:07 GMT
Hi, folks,

I've just blanked out robots.txt on root directory of my intranet server and
tried ./swishspider again and I get 500 in ..response.

(Internal Error 500 
 The server encountered an unexpected condition which prevented it from
fulfilling the request.).

I think it's because I'm running swishspider from our internet server which
is outside
the firewall and of course I can't get thru the fireware to our intranet.
I had a feeling there is no way to go around that.

I'll just have to have the swish-e install on our intranet server.
Please let me know if I'm wrong.

Thanks for all the response.


-----Original Message-----
From: Bill Moseley [mailto:moseley@hank.org]
Sent: Thursday, May 02, 2002 9:51 AM
To: Multiple recipients of list
Subject: [SWISH-E] RE: HTTP Crawler


At 09:27 AM 05/02/02 -0700, Hsiao Ketung Contr 61 CS/SCBN wrote:
>User-Agent: *
>Disallow: /somedirectory/
>Disallow: /somedirectory/
>..
>
>What does robots.txt does and 
>what's your suggestion ?

Google is your friend.

http://www.robotstxt.org/wc/robots.html

If you were to use -S prog with spider.pl you can tell it to ignore
robots.txt.  But, I'd suggest you try to get -S http method working first
before trying to tackle the -S prog / spider.pl setup with swish.


-- 
Bill Moseley
mailto:moseley@hank.org
Received on Thu May 2 18:24:10 2002