Skip to main content.
home | support | download

Back to List Archive

Re: Does the <!-- Swishcommand noindex --> work whe

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Wed Jun 25 2003 - 15:06:44 GMT
On Wed, Jun 25, 2003 at 09:47:39AM -0500, Cleveland@mail.winnefox.org wrote:
> Hello,
> 
> > # don't allow spidering of these specific files:
> > Disallow: /otherpdfdir/pdfdoc.pdf
> > Disallow: /otherpdfdir/pdfdoc2.pdf
> 
> I created a spider.txt file:

You mean robots.txt, of course.

> User-agent: *
> Disallow: /citydirs/1857/1857full.pdf
> 
> and put it at http://www.oshkoshpubliclibrary.org/robots.txt
> 
> So, then I ran the spider again and noticed this:
> 
> +Fetched 2 Cnt: 22
> http://www.oshkoshpubliclibrary.org/citydirs/1857/1857full.pdf 200 OK
> application/pdf 5781897
> parent:http://www.oshkoshpubliclibrary.org/citydirs/browse.html
> 
> Do I have something wrong in robots.txt?

Here's my test:

/var/www is the document root.

moseley@bumby:~/apache$ cat /var/www/robots.txt
User-agent: *
Disallow: /apache/test.pdf


moseley@bumby:~/apache$ cat index.html
<html>
<head>
<title>Title</title>
</head>
<body>
<a href="test.pdf">pdf file</a>
</body>
</html>

Now spider:


moseley(at)not-real.bumby:~/apache$ SPIDER_DEBUG=skipped,url,links /usr/local/lib/swish-e/spider.pl default http://localhost/apache/index.html > /dev/null
/usr/local/lib/swish-e/spider.pl: Reading parameters from 'default'

 -- Starting to spider: http://localhost/apache/index.html --
>> +Fetched 0 Cnt: 1 http://localhost/apache/index.html 200 OK text/html 99 parent:

Extracting links from http://localhost/apache/index.html:

Looking at extracted tag '<a href="test.pdf">'
   href="http://localhost/apache/test.pdf" Added to list of links to follow
>> -Failed 1 Cnt: 2 http://localhost/apache/test.pdf 403 Forbidden by robots.txt Unknown content type ??? parent:http://localhost/apache/index.html
-Skipped 1 http://localhost/apache/test.pdf: 403 Forbidden by robots.txt

Summary for: http://localhost/apache/index.html
Total Bytes: 99  (99.0/sec)
 Total Docs:  1  (1.0/sec)
Unique URLs:  2  (2.0/sec)
 robots.txt:  1  (1.0/sec)

-- 
Bill Moseley
moseley@hank.org
Received on Wed Jun 25 15:06:49 2003