Skip to main content.
home | support | download

Back to List Archive

RE: HTTP Crawler

From: David L Norris <dave(at)not-real.webaugur.com>
Date: Thu May 02 2002 - 18:24:25 GMT
On Thu, 2002-05-02 at 11:27, Hsiao Ketung Contr 61 CS/SCBN wrote:
> This is intersting.
> There is   http://my-intranet-server-name/robots.txt and
> the time stamp of robots.txt is June 1999 , before I took this job.
> I'll have to see what it does and if I can temporarily remove/rename it
> and try to run swishspider again.

You could add a line, or several lines, allowing/disallowing SWISH-E
access to specific URLs.  As Bill suggested, the robotstxt.org site
should be rather helpful in explaining it.

> The content of it is:
> 
> User-Agent: *
> Disallow: /somedirectory/
> Disallow: /somedirectory/
> ..

Yep, that's probably the problem.  

The current spider's User Agent is:
   SwishSpider http://swish-e.org

You can probably add these two lines to the top of your robots.txt:
  User-Agent: SwishSpider
  Disallow: 

That will allow SwishSpider access to everything but still block other
bots.  You might need to use "SwishSpider*" but probably not.

-- 
 David Norris
  Dave's Web - http://www.webaugur.com/dave/
  Augury Net - http://augur.homeip.net/
  ICQ - 412039
Received on Thu May 2 18:24:27 2002