Skip to main content.
home | support | download

Back to List Archive

Re: Intrusive spiders/crawlers

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Sat Jun 09 2001 - 22:09:41 GMT
At 12:03 PM 06/09/01 -0700, Paul Thomas wrote:
>I have some web browseable archives on my web server. I use
>a robot.txt file to designate off-limits web directories and
>also tried protecting private archive directories with .htaccess
>files. However there are still some spiders that just come
>through and gobble everything up anyways.

You have .htaccess configured incorrectly if people are able to spider what
you think are protected directories and files.

You can block specific IPs or blocks of IPs, you can contact their upstream
provider (if not hiding behind a proxy).  And there are various throttling
modules for Apache that will attempt to detect high load from spiders or
misbehaved programs such as Internet Explorer.  Or you can install more
memory and faster disks.

There's currently a discussion on the mod_perl list about this topic.

Good luck,




Bill Moseley
mailto:moseley@hank.org
Received on Sat Jun 9 22:28:54 2001