Control: tags -1 +confirmed
Control: found -1 0.9.6-2
Control: fixed -1 0.10.2-2~bpo9+1
[hmm... sorry for the duplicate email, not sure how i managed to send
the last one, but here's the complete version.]
TL;DR: this is probably fixed in buster, try the version in
stretch-backports.
I can confirm both bugs here. During the last upgrade, needrestart
noticed fail2ban needed a restart, so it did. Here's what systemd sees
right now:
root@marcos:/home/anarcat# systemctl status fail2ban
● fail2ban.service - Fail2Ban Service
Loaded: loaded (/lib/systemd/system/fail2ban.service; enabled; vendor
preset: enabled)
Active: active (running) since Sun 2019-02-17 15:32:58 EST; 9min ago
Docs: man:fail2ban(1)
Process: 5829 ExecStop=/usr/bin/fail2ban-client stop (code=exited, status=255)
Process: 12738 ExecStart=/usr/bin/fail2ban-client -x start (code=exited,
status=0/SUCCESS)
Main PID: 12745 (fail2ban-server)
Tasks: 14 (limit: 4915)
Memory: 33.3M
CPU: 3min 14.842s
CGroup: /system.slice/fail2ban.service
└─12745 /usr/bin/python3 /usr/bin/fail2ban-server -s
/var/run/fail2ban/fail2ban.sock -p /var/run/fail2ban/fail2ba
fév 17 15:42:06 marcos fail2ban.actions[12745]: NOTICE [postfix-auth] Ban
60.168.178.80
fév 17 15:42:07 marcos fail2ban.actions[12745]: NOTICE [postfix-auth] Ban
60.168.178.82
fév 17 15:42:07 marcos fail2ban.actions[12745]: NOTICE [postfix-auth] Ban
60.168.178.84
fév 17 15:42:07 marcos fail2ban.actions[12745]: NOTICE [postfix-auth] Ban
60.168.178.85
fév 17 15:42:07 marcos fail2ban.actions[12745]: NOTICE [postfix-auth] Ban
60.168.178.86
fév 17 15:42:07 marcos fail2ban.actions[12745]: NOTICE [postfix-auth] Ban
60.168.178.87
fév 17 15:42:07 marcos fail2ban.actions[12745]: NOTICE [postfix-auth] Ban
60.168.178.88
fév 17 15:42:08 marcos fail2ban.actions[12745]: NOTICE [postfix-auth] Ban
60.168.178.90
fév 17 15:42:08 marcos fail2ban.actions[12745]: NOTICE [postfix-auth] Ban
60.168.178.91
fév 17 15:42:08 marcos fail2ban.actions[12745]: NOTICE [postfix-auth] Ban
60.168.178.93
fév 17 15:42:08 marcos fail2ban.actions[12745]: NOTICE [postfix-auth] Ban
60.168.178.94
fév 17 15:42:08 marcos fail2ban.actions[12745]: NOTICE [postfix-auth] Ban
60.168.178.95
fév 17 15:42:08 marcos fail2ban.actions[12745]: NOTICE [postfix-auth] Ban
60.168.178.97
... and those lines are still being added there:
fév 17 15:43:28 marcos fail2ban.actions[12745]: NOTICE [postfix-auth] Ban
60.168.182.250
etc...
In other words, fail2ban has been loading a list of IP addresses in the
firewall for the last 9 minutes. Before that, it was *removing* IP
addresses from the firewall for another three minutes (before systemd
gave up, maybe related to #788805):
root@marcos:/home/anarcat# journalctl -u fail2ban.service | grep systemd
fév 17 15:29:57 marcos systemd[1]: Stopping Fail2Ban Service...
fév 17 15:31:27 marcos systemd[1]: fail2ban.service: Stopping timed out.
Terminating.
fév 17 15:31:27 marcos systemd[1]: fail2ban.service: Control process exited,
code=exited status=255
fév 17 15:32:57 marcos systemd[1]: fail2ban.service: State 'stop-sigterm' timed
out. Killing.
fév 17 15:32:57 marcos systemd[1]: fail2ban.service: Killing process 2176
(fail2ban-server) with signal SIGKILL.
fév 17 15:32:57 marcos systemd[1]: fail2ban.service: Main process exited,
code=killed, status=9/KILL
fév 17 15:32:57 marcos systemd[1]: Stopped Fail2Ban Service.
fév 17 15:32:57 marcos systemd[1]: fail2ban.service: Unit entered failed state.
fév 17 15:32:57 marcos systemd[1]: fail2ban.service: Failed with result
'timeout'.
fév 17 15:32:57 marcos systemd[1]: Starting Fail2Ban Service...
fév 17 15:32:58 marcos systemd[1]: Started Fail2Ban Service.
Now it systemd considers the thing "started" even though it's still
loading the IP list. Because systemd timed out, we also see this in the
logs:
fév 17 15:45:39 marcos fail2ban.actions[12745]: NOTICE [postfix-auth]
183.160.227.26 already banned
... obviously, fail2ban didn't have time to remove all addresses from
the firewall. It ended up taking 15 minutes loading the entire IP set
back into memory, between the first IP loaded and the last:
fév 17 15:32:58 marcos systemd[1]: Started Fail2Ban Service.
[...]
fév 17 15:32:58 marcos fail2ban.actions[12745]: NOTICE [postfix-auth] Ban
103.207.37.40
[...]
fév 17 15:47:13 marcos fail2ban.actions[12745]: NOTICE [postfix-auth]
60.168.182.26 already banned
There are many things wrong here:
1. "service fail2ban restart" should try to remove all IP addresses
from the firewall if it's going to re-add them all a second later
2. even if it *does* decide to do that, it shouldn't fail halfway
through.
3. if "stop" waits for all IPs to be cleared out, "start" should as
well
4. loading IP addresses shouldn't be that slow
We could try to fix points 1-3, but I think the problem really lies with
point 4. Why is fail2ban so damn slow at loading those IP sets?
I don't have that many addresses loaded in that jail:
Status for the jail: postfix-auth
|- Filter
| |- Currently failed: 4
| |- Total failed: 456
| `- File list: /var/log/mail.log
`- Actions
|- Currently banned: 4499
|- Total banned: 4499
`- Banned IP list: [...]
A 5000 IP block list is really not that much. Note that I'm using the
`ipset` functionality in my jails to improve on that process (it was
even slower before):
# head -5 /etc/fail2ban/jail.local
[DEFAULT]
findtime = 60
banaction = iptables-ipset-proto6
banaction_allports = iptables-ipset-proto6-allports
This should be fast enough - ipset is *designed* to handle large sets of
IP addresses quickly. If my math is right, fail2ban only adds 5 IPs per
second to ipset, or a latency of 190ms. Yet ipset itself can run much,
much faster:
already-present# time ipset add f2b-postfix-auth 113.117.65.122
ipset v6.30: Element cannot be added to the set: it's already added
real 0m0,003s
user 0m0,000s
sys 0m0,000s
not-present# time ipset add f2b-postfix-auth 113.117.65.123
real 0m0,002s
user 0m0,000s
sys 0m0,000s
At that speed (2ms / IP) Adding 4500 IPs with ipset should take only 9
seconds (4499 * 0.002 s), not 15 minutes. So there's something really
wrong in the way fail2ban talks to ipset, there's probably a hot loop
somewhere that's totally screwing things up. It might be possible to
speed up things by adding all entries in the ipset at once, for example.
0.10.3 has seen "several stability and performance optimizations", maybe
it would be worth updating there (#922539).
I have tested the backport (0.10.2-2~bpo9+1) and it already loads the
list much, much faster. After upgrading, here's the service state:
Active: active (running) since Sun 2019-02-17 16:15:26 EST; 1min 22s ago
[...]
fév 17 16:16:29 marcos fail2ban-server[22816]: fail2ban.actions [22816]:
NOTICE [postfix-auth] Ban 183.160.224.166
The latter is the last loaded IP, so it took a little over a minute to
load the IP set. Restarting is also much faster: it can stop the service
in three seconds. Reloading the ipset does take a minute (still 6 times
what it should really be taking) but it's already an improvement enough
that I'm daring to mark this as solved in buster (through backports,
naturally).
I hope that helps,
A.
A.
--
On ne résout pas un problème avec les modes de pensée qui l'ont
engendré.
- Albert Einstein