Hi All Have been reading on the archive hoping to implement unkillablesteptimeout and unkillablesteprogram to the slurm But I'm kind of confuse with it application
1. I presume UnkillableStepTimeout is set in slurm.conf. and it act as a timer to trigger UnkillableStepProgram 2. UnkillableStepProgram can be use to send email or reboot compute node - question is how do we configure it ? scontrol show config | grep -i kill KillOnBadExit = 1 KillWait = 30 sec UnkillableStepProgram = (null) UnkillableStepTimeout = 300 sec Please advise Thanks Mike