Here’s how to set up a centrally managed ban list to protect your server from badly behaved bots and other miscreants. With a simple addition, you can also prevent bad bot visits from even appearing in your Apache access logs.
A separate article explored some basic Apache performance tweaks to help your OS X Server work to its full potential (“Essential Performance Tweaks for Your New OS X Server”), while three others (starting with “Improve OS X Performance by Tweaking Apache Logging, And Options for Rotating Logs”) covered piped logging and log rotation with
newsyslog, and another explained how to speed up services for IE users (“Improve OS X Server Performance for IE Browsers”). One other improvement that’s fairly straightforward to make is to move any global ban lists you might maintain — say, for the purpose of blocking rude bots — into a
.conf file rather than an
.htaccess file. We can also set it up so that bad bot visits won’t even be mentioned in the main access logs, further saving load on the server from unwanted spidering.
(Although it’s common to maintain ban lists in an
.htaccess file, doing so creates two problems — one for general performance, the other for maintenance. In terms of performance, the
.htaccess file is read for every page load, while a
.conf file is read just once when Apache starts up. And in terms of general maintenance, keeping ban lists in
.htaccess files means updating each of those files every time you want to make a change, rather than just changing one central configuration file.)
If you’ve used the logging cleanup suggestions from the earlier article, you will have set up logging so as to be skipped whenever the environment variable ‘donotlog’ has been set. We can take advantage of this by structuring a ban list like so (note that the large Russian Yandex operation cited here is just one example of a ‘bad bot’ from my own logs — naturally, you should base this on your own experience):
# # Overall ban list for known junk and nuisances # # Yandex not obeying robots.txt exclusion SetEnvIf Remote_Addr "87\.250\.255\.243" badbot donotlog SetEnvIf Remote_Addr "199\.21\.99\.[0-9]+" badbot donotlog SetEnvIf Remote_Addr "5\.45\.202\.218" badbot donotlog # Now deny access to bad bots <Files "*"> Order deny,allow Deny from env=badbot </Files> # Allow everyone to see a 403 error <Files "403.shtml"> Order deny,allow Allow from all </Files> # And allow everyone to see the robots.txt <Files "robots.txt"> Order deny,allow Allow from all </Files>
What’s happening here is that whenever the visitor’s IP matches the regex shown, we’re setting both the environment variable ‘badbot’ and the environment variable ‘donotlog’. Then we deny access to anything if the ‘badbot’ environment variable is set, but we enable access if the request is for the 403 error page. (We can also enable access to the
robots.txt file, but the reason I place bad bots on a ban list in the first place is that they have ignored the
robots.txt, so I don’t see much benefit in doing this.) The fact that we’ve also set the ‘donotlog’ variable is what will enable us to skip logging and not be bothered with them. Using this method, the visits will still be noted in the error log, however.
Rather than IP-based banning, which you can also achieve with
afctl (see the separate article on the adaptive firewall: “Mac OS X Packet Filter and Adaptive Firewall”) you can also make use of other matches, such as on user agent. So, for example, a general ban of the TurnItIn bot could look like this:
SetEnvIfNoCase User-Agent "turnitinbot.*" badbot donotlog
In addition, you can set the relevant environment variables via a
RewriteRule, as shown in this example (which does not function, because I haven’t filled in anything for the
RewriteCond ... # whatever your conditions might be... RewriteRule .* -- [E=badbot:yes,E=donotlog:yes]
Once you’ve put together a ban list like this, it’s a simple matter to drop it into a
.conf file placed in
.../apache2/extra and then add an extra line to the main
httpd_server_app.conf as described in the earlier article about performance tweaks (“Essential Performance Tweaks for Your New OS X Server”):
Any time you modify the file, all it takes is an Apache restart to re-load the information:
sudo apachectl graceful
As noted in that earlier article, any time you mess with anything in an Apple-supplied config file, that obliges you to check after each software update to see if anything has been changed in the default config file, and copy over any changes as needed.
All material on this site is carefully reviewed, but its accuracy cannot be guaranteed, and some suggestions offered here might just be silly ideas. For best results, please do your own checking and verifying. This specific article was last reviewed or updated by Greg on .