Hello,I am having an odd problem where users are unable to kill their jobs with scancel. Users can submit jobs just fine and when the task completes it is able to close correctly. However, if a user attempts to cancel a job via scancel the SIGKILL signals are sent to the step but don't complete. Slurmd then continues to send SIGKILL requests until the UnkillableTimeout is hit, the slurm job is exits with an error, the node enters a draining state, and the spawn processes continue to run on the node.
I'm at a loss because jobs can complete without issue which seems to suggest it's not a networking or permissions issue for the slurm to do job accounting tasks. A user can ssh to the node once a job is submitted and kill the subprocesses manually at which point slurm completes the epilog and the node returns to idle.
Does anyone know what may be causing such behavior? Please let me know any slurm.conf or cgroup.conf settings that would be helpful to diagnose this issue. I'm quite stumped by this one.
-- Willy Markuske HPC Systems Engineer Research Data Services P: (858) 246-5593
OpenPGP_0xD42F81D406AC0BA2.asc
Description: application/pgp-keys
OpenPGP_signature
Description: OpenPGP digital signature