On 9/18/25 12:39, Lorenzo Bosio via slurm-users wrote:
as an example, my UnkillableStepProgram is just a bash script collecting
recent logs and processes and mailing me about the error. Nothing special.
We use Slurm "triggers" to get alerts from many different types of events, see
https://github.com/OleHolmNielsen/Slurm_tools/tree/master/triggers
Relevant here is the "notify_nodes_drained" Trigger script for node
drained state
We don't use an UnkillableStepProgram. In my experience the *Kill task
failed* events discussed earlier in this thread require a manual
examination of why the job failed to die, and I think it will be hard to
write a script to examine all kinds of possible errors.
The most common scenario is stale I/O from the job to a network file
server, and I described in a previous post how we deal with this.
BTW we use this parameter: UnkillableStepTimeout = 180 sec
Il giorno gio 18 set 2025 alle ore 12:22 Gestió Servidors via slurm-users
<[email protected] <mailto:[email protected]>> ha
scritto:
After reading answer from Ole Holm Nielsen, I have increased
“MessageTimeout” to 20s (by default is 5s) and “UnkillableStepTimeout”
to 150s (by default is 60s and, always 5 times larger than
“MessageTimeout”). However, I have also read that
UnkillableStepProgram indicates the program to use in that cases...
but, by default there is no program assigned to that parameter (no
program to run). So my question is if someone uses a customized
“UnkillableStepProgram” and if he/she could explain that.____
IHTH,
Ole
--
slurm-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]