Hi;

Check epilog return value which comes from the return value of the last line of epilog script. Also, you can add a "exit 0" line at the last line of the epilog script to ensure to get a zero return value for testing purpose.

Ahmet M.


18.11.2020 20:00 tarihinde William Markuske yazdı:

Hello,

I am having an odd problem where users are unable to kill their jobs with scancel. Users can submit jobs just fine and when the task completes it is able to close correctly. However, if a user attempts to cancel a job via scancel the SIGKILL signals are sent to the step but don't complete. Slurmd then continues to send SIGKILL requests until the UnkillableTimeout is hit, the slurm job is exits with an error, the node enters a draining state, and the spawn processes continue to run on the node.

I'm at a loss because jobs can complete without issue which seems to suggest it's not a networking or permissions issue for the slurm to do job accounting tasks. A user can ssh to the node once a job is submitted and kill the subprocesses manually at which point slurm completes the epilog and the node returns to idle.

Does anyone know what may be causing such behavior? Please let me know any slurm.conf or cgroup.conf settings that would be helpful to diagnose this issue. I'm quite stumped by this one.

--

Willy Markuske

HPC Systems Engineer

        

Research Data Services

P: (858) 246-5593


Reply via email to