On Wed, Jun 25, 2003 at 09:47:39AM -0500, Cleveland@mail.winnefox.org wrote: > Hello, > > > # don't allow spidering of these specific files: > > Disallow: /otherpdfdir/pdfdoc.pdf > > Disallow: /otherpdfdir/pdfdoc2.pdf > > I created a spider.txt file: You mean robots.txt, of course. > User-agent: * > Disallow: /citydirs/1857/1857full.pdf > > and put it at http://www.oshkoshpubliclibrary.org/robots.txt > > So, then I ran the spider again and noticed this: > > +Fetched 2 Cnt: 22 > http://www.oshkoshpubliclibrary.org/citydirs/1857/1857full.pdf 200 OK > application/pdf 5781897 > parent:http://www.oshkoshpubliclibrary.org/citydirs/browse.html > > Do I have something wrong in robots.txt? Here's my test: /var/www is the document root. moseley@bumby:~/apache$ cat /var/www/robots.txt User-agent: * Disallow: /apache/test.pdf moseley@bumby:~/apache$ cat index.html <html> <head> <title>Title</title> </head> <body> <a href="test.pdf">pdf file</a> </body> </html> Now spider: moseley(at)not-real.bumby:~/apache$ SPIDER_DEBUG=skipped,url,links /usr/local/lib/swish-e/spider.pl default http://localhost/apache/index.html > /dev/null /usr/local/lib/swish-e/spider.pl: Reading parameters from 'default' -- Starting to spider: http://localhost/apache/index.html -- >> +Fetched 0 Cnt: 1 http://localhost/apache/index.html 200 OK text/html 99 parent: Extracting links from http://localhost/apache/index.html: Looking at extracted tag '<a href="test.pdf">' href="http://localhost/apache/test.pdf" Added to list of links to follow >> -Failed 1 Cnt: 2 http://localhost/apache/test.pdf 403 Forbidden by robots.txt Unknown content type ??? parent:http://localhost/apache/index.html -Skipped 1 http://localhost/apache/test.pdf: 403 Forbidden by robots.txt Summary for: http://localhost/apache/index.html Total Bytes: 99 (99.0/sec) Total Docs: 1 (1.0/sec) Unique URLs: 2 (2.0/sec) robots.txt: 1 (1.0/sec) -- Bill Moseley moseley@hank.orgReceived on Wed Jun 25 15:06:49 2003