Control: tags -1 +confirmed
Control: found -1 0.9.6-2
Control: fixed -1 0.10.2-2~bpo9+1

[hmm... sorry for the duplicate email, not sure how i managed to send
the last one, but here's the complete version.]

TL;DR: this is probably fixed in buster, try the version in
stretch-backports.

I can confirm both bugs here. During the last upgrade, needrestart
noticed fail2ban needed a restart, so it did. Here's what systemd sees
right now:

root@marcos:/home/anarcat# systemctl status fail2ban
● fail2ban.service - Fail2Ban Service
   Loaded: loaded (/lib/systemd/system/fail2ban.service; enabled; vendor 
preset: enabled)
   Active: active (running) since Sun 2019-02-17 15:32:58 EST; 9min ago
     Docs: man:fail2ban(1)
  Process: 5829 ExecStop=/usr/bin/fail2ban-client stop (code=exited, status=255)
  Process: 12738 ExecStart=/usr/bin/fail2ban-client -x start (code=exited, 
status=0/SUCCESS)
 Main PID: 12745 (fail2ban-server)
    Tasks: 14 (limit: 4915)
   Memory: 33.3M
      CPU: 3min 14.842s
   CGroup: /system.slice/fail2ban.service
           └─12745 /usr/bin/python3 /usr/bin/fail2ban-server -s 
/var/run/fail2ban/fail2ban.sock -p /var/run/fail2ban/fail2ba

fév 17 15:42:06 marcos fail2ban.actions[12745]: NOTICE [postfix-auth] Ban 
60.168.178.80
fév 17 15:42:07 marcos fail2ban.actions[12745]: NOTICE [postfix-auth] Ban 
60.168.178.82
fév 17 15:42:07 marcos fail2ban.actions[12745]: NOTICE [postfix-auth] Ban 
60.168.178.84
fév 17 15:42:07 marcos fail2ban.actions[12745]: NOTICE [postfix-auth] Ban 
60.168.178.85
fév 17 15:42:07 marcos fail2ban.actions[12745]: NOTICE [postfix-auth] Ban 
60.168.178.86
fév 17 15:42:07 marcos fail2ban.actions[12745]: NOTICE [postfix-auth] Ban 
60.168.178.87
fév 17 15:42:07 marcos fail2ban.actions[12745]: NOTICE [postfix-auth] Ban 
60.168.178.88
fév 17 15:42:08 marcos fail2ban.actions[12745]: NOTICE [postfix-auth] Ban 
60.168.178.90
fév 17 15:42:08 marcos fail2ban.actions[12745]: NOTICE [postfix-auth] Ban 
60.168.178.91
fév 17 15:42:08 marcos fail2ban.actions[12745]: NOTICE [postfix-auth] Ban 
60.168.178.93
fév 17 15:42:08 marcos fail2ban.actions[12745]: NOTICE [postfix-auth] Ban 
60.168.178.94
fév 17 15:42:08 marcos fail2ban.actions[12745]: NOTICE [postfix-auth] Ban 
60.168.178.95
fév 17 15:42:08 marcos fail2ban.actions[12745]: NOTICE [postfix-auth] Ban 
60.168.178.97

... and those lines are still being added there:

fév 17 15:43:28 marcos fail2ban.actions[12745]: NOTICE [postfix-auth] Ban 
60.168.182.250

etc...

In other words, fail2ban has been loading a list of IP addresses in the
firewall for the last 9 minutes. Before that, it was *removing* IP
addresses from the firewall for another three minutes (before systemd
gave up, maybe related to #788805):

root@marcos:/home/anarcat# journalctl -u fail2ban.service | grep systemd
fév 17 15:29:57 marcos systemd[1]: Stopping Fail2Ban Service...
fév 17 15:31:27 marcos systemd[1]: fail2ban.service: Stopping timed out. 
Terminating.
fév 17 15:31:27 marcos systemd[1]: fail2ban.service: Control process exited, 
code=exited status=255
fév 17 15:32:57 marcos systemd[1]: fail2ban.service: State 'stop-sigterm' timed 
out. Killing.
fév 17 15:32:57 marcos systemd[1]: fail2ban.service: Killing process 2176 
(fail2ban-server) with signal SIGKILL.
fév 17 15:32:57 marcos systemd[1]: fail2ban.service: Main process exited, 
code=killed, status=9/KILL
fév 17 15:32:57 marcos systemd[1]: Stopped Fail2Ban Service.
fév 17 15:32:57 marcos systemd[1]: fail2ban.service: Unit entered failed state.
fév 17 15:32:57 marcos systemd[1]: fail2ban.service: Failed with result 
'timeout'.
fév 17 15:32:57 marcos systemd[1]: Starting Fail2Ban Service...
fév 17 15:32:58 marcos systemd[1]: Started Fail2Ban Service.

Now it systemd considers the thing "started" even though it's still
loading the IP list. Because systemd timed out, we also see this in the
logs:

fév 17 15:45:39 marcos fail2ban.actions[12745]: NOTICE [postfix-auth] 
183.160.227.26 already banned 

... obviously, fail2ban didn't have time to remove all addresses from
the firewall. It ended up taking 15 minutes loading the entire IP set
back into memory, between the first IP loaded and the last:

fév 17 15:32:58 marcos systemd[1]: Started Fail2Ban Service.
[...]
fév 17 15:32:58 marcos fail2ban.actions[12745]: NOTICE [postfix-auth] Ban 
103.207.37.40
[...]
fév 17 15:47:13 marcos fail2ban.actions[12745]: NOTICE [postfix-auth] 
60.168.182.26 already banned

There are many things wrong here:

 1. "service fail2ban restart" should try to remove all IP addresses
    from the firewall if it's going to re-add them all a second later

 2. even if it *does* decide to do that, it shouldn't fail halfway
    through.

 3. if "stop" waits for all IPs to be cleared out, "start" should as
    well

 4. loading IP addresses shouldn't be that slow

We could try to fix points 1-3, but I think the problem really lies with
point 4. Why is fail2ban so damn slow at loading those IP sets?

I don't have that many addresses loaded in that jail:

    Status for the jail: postfix-auth
    |- Filter
    |  |- Currently failed: 4
    |  |- Total failed:     456
    |  `- File list:        /var/log/mail.log
    `- Actions
       |- Currently banned: 4499
       |- Total banned:     4499 
       `- Banned IP list:   [...]

A 5000 IP block list is really not that much. Note that I'm using the
`ipset` functionality in my jails to improve on that process (it was
even slower before):

    # head -5 /etc/fail2ban/jail.local 
    [DEFAULT]
    findtime = 60
    banaction = iptables-ipset-proto6
    banaction_allports = iptables-ipset-proto6-allports

This should be fast enough - ipset is *designed* to handle large sets of
IP addresses quickly. If my math is right, fail2ban only adds 5 IPs per
second to ipset, or a latency of 190ms. Yet ipset itself can run much,
much faster:

    already-present# time ipset add f2b-postfix-auth 113.117.65.122
    ipset v6.30: Element cannot be added to the set: it's already added
    
    real    0m0,003s
    user    0m0,000s
    sys     0m0,000s
    not-present# time ipset add f2b-postfix-auth 113.117.65.123
    
    real    0m0,002s
    user    0m0,000s
    sys     0m0,000s

At that speed (2ms / IP) Adding 4500 IPs with ipset should take only 9
seconds (4499 * 0.002 s), not 15 minutes. So there's something really
wrong in the way fail2ban talks to ipset, there's probably a hot loop
somewhere that's totally screwing things up. It might be possible to
speed up things by adding all entries in the ipset at once, for example.

0.10.3 has seen "several stability and performance optimizations", maybe
it would be worth updating there (#922539).

I have tested the backport (0.10.2-2~bpo9+1) and it already loads the
list much, much faster. After upgrading, here's the service state:

   Active: active (running) since Sun 2019-02-17 16:15:26 EST; 1min 22s ago
[...]
fév 17 16:16:29 marcos fail2ban-server[22816]: fail2ban.actions        [22816]: 
NOTICE  [postfix-auth] Ban 183.160.224.166

The latter is the last loaded IP, so it took a little over a minute to
load the IP set. Restarting is also much faster: it can stop the service
in three seconds. Reloading the ipset does take a minute (still 6 times
what it should really be taking) but it's already an improvement enough
that I'm daring to mark this as solved in buster (through backports,
naturally).

I hope that helps,

A.

A.

-- 
On ne résout pas un problème avec les modes de pensée qui l'ont
engendré.
                        - Albert Einstein

Reply via email to