These log lines about the prolog script looks very suspicious to me:
[2020-11-18T10:19:35.388] debug: [job 110] attempting to run prolog
[/cm/local/apps/cmd/scripts/prolog]
then
[2020-11-18T10:21:10.121] debug: Waiting for job 110's prolog to complete
[2020-11-18T10:21:10.121] debug: Finis
The epilog script does have exit 0 set at the end. Epilogs exit cleanly
when run.
With log set to debug5 I get the following results for any scancel call.
Submit host slurmctld.log
[2020-11-18T10:19:34.944] _slurm_rpc_submit_batch_job: JobId=110
InitPrio=110503 usec=191
[2020-11-18T10:19:35.
Hi;
Check epilog return value which comes from the return value of the last
line of epilog script. Also, you can add a "exit 0" line at the last
line of the epilog script to ensure to get a zero return value for
testing purpose.
Ahmet M.
18.11.2020 20:00 tarihinde William Markuske yazdı:
Hello,
I am having an odd problem where users are unable to kill their jobs
with scancel. Users can submit jobs just fine and when the task
completes it is able to close correctly. However, if a user attempts to
cancel a job via scancel the SIGKILL signals are sent to the step but
don't compl