On 9/18/25 12:39, Lorenzo Bosio via slurm-users wrote:
as an example, my UnkillableStepProgram is just a bash script collecting recent logs and processes and mailing me about the error. Nothing special.

We use Slurm "triggers" to get alerts from many different types of events, see
https://github.com/OleHolmNielsen/Slurm_tools/tree/master/triggers

Relevant here is the "notify_nodes_drained" Trigger script for node drained state

We don't use an UnkillableStepProgram. In my experience the *Kill task failed* events discussed earlier in this thread require a manual examination of why the job failed to die, and I think it will be hard to write a script to examine all kinds of possible errors.

The most common scenario is stale I/O from the job to a network file server, and I described in a previous post how we deal with this.

BTW we use this parameter: UnkillableStepTimeout = 180 sec

Il giorno gio 18 set 2025 alle ore 12:22 Gestió Servidors via slurm-users <[email protected] <mailto:[email protected]>> ha scritto:

    After reading answer from Ole Holm Nielsen, I have increased
    “MessageTimeout” to 20s (by default is 5s) and “UnkillableStepTimeout”
    to 150s (by default is 60s and, always 5 times larger than
    “MessageTimeout”). However, I have also read that
    UnkillableStepProgram indicates the program to use in that cases...
    but, by default there is no program assigned to that parameter (no
    program to run). So my question is if someone uses a customized
    “UnkillableStepProgram” and if he/she could explain that.____

IHTH,
Ole

--
slurm-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to