Hi;
Check epilog return value which comes from the return value of the last
line of epilog script. Also, you can add a "exit 0" line at the last
line of the epilog script to ensure to get a zero return value for
testing purpose.
Ahmet M.
18.11.2020 20:00 tarihinde William Markuske yazdı:
Hello,
I am having an odd problem where users are unable to kill their jobs
with scancel. Users can submit jobs just fine and when the task
completes it is able to close correctly. However, if a user attempts
to cancel a job via scancel the SIGKILL signals are sent to the step
but don't complete. Slurmd then continues to send SIGKILL requests
until the UnkillableTimeout is hit, the slurm job is exits with an
error, the node enters a draining state, and the spawn processes
continue to run on the node.
I'm at a loss because jobs can complete without issue which seems to
suggest it's not a networking or permissions issue for the slurm to do
job accounting tasks. A user can ssh to the node once a job is
submitted and kill the subprocesses manually at which point slurm
completes the epilog and the node returns to idle.
Does anyone know what may be causing such behavior? Please let me know
any slurm.conf or cgroup.conf settings that would be helpful to
diagnose this issue. I'm quite stumped by this one.
--
Willy Markuske
HPC Systems Engineer
Research Data Services
P: (858) 246-5593