Hi, I ran into this problem today. Automated upgrades with unattended-upgrades upgraded debianutils. Then needrestart decided that anacron.service needed to be restart.
However, backup cron jobs were running at the time, which take longer than a minute or two. After the brief timeout every single process in the chain got killed abruptly, without any of the actual cron job tasks even receiving a SIGTERM to clean up nicely. Worst of all this means the process does not even get a chance to report error/failure, so not mail ends up in the mailbox. The result (with borgbackup at least) means a lot of manual work to cleanup stale repository locks acquired, checking some caches and manually unmount and removing the snapshots made for backup purposes. I strongly feel like stopping/restarting anacron.service should never, ever timeout at all. A very long-running (possibly stuck) cron job should result in a blocking (or failing) stop action which can then be investigated properly by the administrator. Such as event would be a bug in another package and not a problem with anacron daemon. Forcefully killing long-running cron jobs can have severe consequences. In today's case it was recoverable but similar cron jobs could also perform automated cleanup/pruning tasks in databases, registries, etc, where killing is very, very much undesired and effectively as bad as system crash for data integrity purposes. I can think of two ways to improve this. 1. Always let jobs finish cleanly: TimeoutStopSec=infinity I strongly prefer this option in all cases (desktop/server/...). 2. SIGUSR1 anacron as is the case now, then on timeout SIGTERM to all processes in the group, then on timeout again SIGKILL all processes in the group. I must admit I don't know how to implement this with systemd services. Could you share thoughts regarding this issue? Thanks, -- Melvin Vermeeren Systems engineer
signature.asc
Description: This is a digitally signed message part.