Block automated scanners from scanning a website

Sunday 8 July 2018 (2018-07-08)

Sunday 29 June 2025 (2025-06-29)

noraj (Alexandre ZANNI)

fail2ban, linux, nginx, web

🇬🇧

Disclaimer#

This post describes how to block automated scanners from scanning a website.

Requirements#

You will need:

fail2ban
nginx (another web server can work too but this will need some modifications)
create some files in the website directory

I tested it with:

nginx 1.15.1
fail2ban 0.10.3.fix1

How does it work?#

robots.txt is normally used to give instructions to web robots (bots) about what to index or what must not be accessed.

Legitimate bots from search engines won't index disallowed pages but bad bots will, on the contrary, attempt to access and interact with those pages.

So we will block all IP interacting with a fake page.

This method should work with any web security scanners because most of them will try to parse robots.txt to find some interesting path like wp-admin.

Steps to follow#

First we have to create a robots.txt file at the root directory of the web server we want to protect or append those lines if it already exists.

User-agent: *
Disallow: /pathNeverUsedInYourApp/

For my example I will use this one:

User-agent: *
Disallow: /26d20db82aa40be369216bc01db23e6f8b1048d9/

The robots are instructed not to index the directory /26d20db82aa40be369216bc01db23e6f8b1048d9/.

But web security scanners parse and crawl all the URLs from robots.txt in order to find administrative interfaces or high-value resources from a security point of view.

Normal visitors of the website would not know about this directory since it is not linked from the site. Also, search engines would not visit this directory since they are restricted from robots.txt. Only web security scanners or hackers would know about this directory.

In /26d20db82aa40be369216bc01db23e6f8b1048d9/ we will create a HTML file containing a hidden form and this form will call a fake web page.

This file contains a hidden form that is only visible to web security scanners as they ignore CSS stylesheets.

So create /26d20db82aa40be369216bc01db23e6f8b1048d9/index.html containing something like that:

<!DOCTYPE html>
<html>
<head>
  <style>
  .obfu {
    display: none;
  }
  </style>
</head>
<body>
  <div class='obfu'>
    <p>Secret area</p>
    <form method='POST' action='hidden.php'>
      <input type='text' name='subject' value='154'>
      <input type='submit' value='submit'>
    </form>
  </div>
</body>
</html>

Note: making the form invisible is not mandatory, it is only to avoid humans to get here.

The web scanner will submit this form and start testing the form inputs with various payloads looking for vulnerabilities.

If someone visit hidden.php is must be a hacker or a web security scanner. So we will log the attempt and block the IP address.

Question: why blocking only bots submitting the form and not all people hitting /26d20db82aa40be369216bc01db23e6f8b1048d9/? Because this should block people or tool that are too curious but not malicious.

Before blocking anything we need to log the attempts. For the sake of simplicity we will log those attempts in a separated file.

We need to edit the nginx web server config, e.g. /etc/nginx/servers-available/website.conf, and to add a location block inside the server block:

location ~ /26d20db82aa40be369216bc01db23e6f8b1048d9/hidden\.php {
  access_log /var/log/nginx/ban.log;
}

With this config all requests hitting 26d20db82aa40be369216bc01db23e6f8b1048d9/hidden.php will be logged into /var/log/nginx/ban.log.

Now we will use Fail2Ban to automatically block the IP address. Fail2Ban will scan the log file and ban the malicious IP addresses.

The Fail2Ban configuration is as follows (to add to the existing configuration):

/etc/fail2ban/jail.d/jail.conf

# block bad bots not respecting robots.txt
[robots-txt]
enabled = true
filter = robots-txt
action = iptables-multiport[name=HTTP, port="http,https", protocol=tcp]
logpath = /var/log/nginx/ban.log
bantime = 18000
maxretry = 1

/etc/fail2ban/filter.d/robots-txt.conf

[Definition]

failregex = ^<HOST> -.*

ignoreregex =

Note: it is possible to do the same thing with /var/log/nginx/access.log and a more complex failregex in order to avoid 1) to have a separate log file with duplicate entries and 2) to modify nginx configuration.

Of course to make the changes effective you need to reload nginx sudo systemctl reload nginx.service and fail2ban sudo systemctl reload fail2ban.service.

Note: before testing by yourself, whitelist your IP address or use a proxy or a VPN in order to avoid to get blocked. To unblock an IP address use sudo fail2ban-client set robots-txt 192.168.0.20.

Finally you can check if there are blocked IP addresses.

$ sudo fail2ban-client status robots-txt
Status for the jail: robots-txt
|- Filter
|  |- Currently failed: 0
|  |- Total failed:     11
|  `- File list:        /var/log/nginx/ban.log
`- Actions
   |- Currently banned: 1
   |- Total banned:     11
   `- Banned IP list:   192.168.0.20

Source#

I was highly inspired by this article: How to Block Automated Scanners from Scanning your Site, Posted on July 9, 2014 by Bogdan Calin on Acunextix blog.

securitylinux

Disclaimer#

Requirements#

How does it work?#

Steps to follow#

Source#