[slurm-users] Re: Node in drain state

Ole Holm Nielsen via slurm-users Fri, 19 Sep 2025 00:02:40 -0700

On 9/18/25 12:39, Lorenzo Bosio via slurm-users wrote:

as an example, my UnkillableStepProgram is just a bash script collectingrecent logs and processes and mailing me about the error. Nothing special.


We use Slurm "triggers" to get alerts from many different types of events, see
https://github.com/OleHolmNielsen/Slurm_tools/tree/master/triggers

Relevant here is the "notify_nodes_drained" Trigger script for nodedrained state

We don't use an UnkillableStepProgram. In my experience the *Kill taskfailed* events discussed earlier in this thread require a manualexamination of why the job failed to die, and I think it will be hard towrite a script to examine all kinds of possible errors.

The most common scenario is stale I/O from the job to a network fileserver, and I described in a previous post how we deal with this.


BTW we use this parameter: UnkillableStepTimeout = 180 sec

Il giorno gio 18 set 2025 alle ore 12:22 Gestió Servidors via slurm-users<[email protected] <mailto:[email protected]>> hascritto:


    After reading answer from Ole Holm Nielsen, I have increased
    “MessageTimeout” to 20s (by default is 5s) and “UnkillableStepTimeout”
    to 150s (by default is 60s and, always 5 times larger than
    “MessageTimeout”). However, I have also read that
    UnkillableStepProgram indicates the program to use in that cases...
    but, by default there is no program assigned to that parameter (no
    program to run). So my question is if someone uses a customized
    “UnkillableStepProgram” and if he/she could explain that.____


IHTH,
Ole

--
slurm-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[slurm-users] Re: Node in drain state

Reply via email to