Hello,

I am having an odd problem where users are unable to kill their jobs with scancel. Users can submit jobs just fine and when the task completes it is able to close correctly. However, if a user attempts to cancel a job via scancel the SIGKILL signals are sent to the step but don't complete. Slurmd then continues to send SIGKILL requests until the UnkillableTimeout is hit, the slurm job is exits with an error, the node enters a draining state, and the spawn processes continue to run on the node.

I'm at a loss because jobs can complete without issue which seems to suggest it's not a networking or permissions issue for the slurm to do job accounting tasks. A user can ssh to the node once a job is submitted and kill the subprocesses manually at which point slurm completes the epilog and the node returns to idle.

Does anyone know what may be causing such behavior? Please let me know any slurm.conf or cgroup.conf settings that would be helpful to diagnose this issue. I'm quite stumped by this one.

--

Willy Markuske

HPC Systems Engineer

        

Research Data Services

P: (858) 246-5593

Attachment: OpenPGP_0xD42F81D406AC0BA2.asc
Description: application/pgp-keys

Attachment: OpenPGP_signature
Description: OpenPGP digital signature

Reply via email to