Hi,

I ran into this problem today. Automated upgrades with unattended-upgrades 
upgraded debianutils. Then needrestart decided that anacron.service needed to 
be restart.

However, backup cron jobs were running at the time, which take longer than a 
minute or two. After the brief timeout every single process in the chain got 
killed abruptly, without any of the actual cron job tasks even receiving a 
SIGTERM to clean up nicely. Worst of all this means the process does not even 
get a chance to report error/failure, so not mail ends up in the mailbox.

The result (with borgbackup at least) means a lot of manual work to cleanup 
stale repository locks acquired, checking some caches and manually unmount and 
removing the snapshots made for backup purposes.

I strongly feel like stopping/restarting anacron.service should never, ever 
timeout at all. A very long-running (possibly stuck) cron job should result in 
a blocking (or failing) stop action which can then be investigated properly by 
the administrator. Such as event would be a bug in another package and not a 
problem with anacron daemon.

Forcefully killing long-running cron jobs can have severe consequences. In 
today's case it was recoverable but similar cron jobs could also perform 
automated cleanup/pruning tasks in databases, registries, etc, where killing 
is very, very much undesired and effectively as bad as system crash for data 
integrity purposes.

I can think of two ways to improve this.

1. Always let jobs finish cleanly: TimeoutStopSec=infinity
I strongly prefer this option in all cases (desktop/server/...).

2. SIGUSR1 anacron as is the case now, then on timeout SIGTERM to all 
processes in the group, then on timeout again SIGKILL all processes in the 
group. I must admit I don't know how to implement this with systemd services.

Could you share thoughts regarding this issue?

Thanks,

-- 
Melvin Vermeeren
Systems engineer

Attachment: signature.asc
Description: This is a digitally signed message part.

Reply via email to