Skip to main content.
home | support | download

Back to List Archive

RE: HTTP Crawler

From: David L Norris <dave(at)>
Date: Thu May 02 2002 - 18:24:25 GMT
On Thu, 2002-05-02 at 11:27, Hsiao Ketung Contr 61 CS/SCBN wrote:
> This is intersting.
> There is   http://my-intranet-server-name/robots.txt and
> the time stamp of robots.txt is June 1999 , before I took this job.
> I'll have to see what it does and if I can temporarily remove/rename it
> and try to run swishspider again.

You could add a line, or several lines, allowing/disallowing SWISH-E
access to specific URLs.  As Bill suggested, the site
should be rather helpful in explaining it.

> The content of it is:
> User-Agent: *
> Disallow: /somedirectory/
> Disallow: /somedirectory/
> ..

Yep, that's probably the problem.  

The current spider's User Agent is:

You can probably add these two lines to the top of your robots.txt:
  User-Agent: SwishSpider

That will allow SwishSpider access to everything but still block other
bots.  You might need to use "SwishSpider*" but probably not.

 David Norris
  Dave's Web -
  Augury Net -
  ICQ - 412039
Received on Thu May 2 18:24:27 2002