On Tuesday, 03 May 2022, at 15:46:38 (+0800),
taleinterve...@sjtu.edu.cn wrote:

We need to detect some problem at job end timepoint, so we write some
detection script in slurm epilog, which should drain the node if check is
not passed.

I know exit epilog with non-zero code will make slurm automatically drain
the node. But in such way, drain reason will all be marked as "Epilog
error". Then our auto-repair program will have trouble to determine how to
repair the node.

Another way is call scontrol directly from epilog to drain the node, but
from official doc https://slurm.schedmd.com/prolog_epilog.html  it wrote:

Prolog and Epilog scripts should be designed to be as short as possible and
should not call Slurm commands (e.g. squeue, scontrol, sacctmgr, etc). .
Slurm commands in these scripts can potentially lead to performance issues
and should not be used.

So what is the best way to drain node from epilog with a self-defined
reason, or tell slurm to add more verbose message besides "Epilog error"
reason?

Invoking `scontrol` from a prolog/epilog script to simply alter nodes'
state and/or reason fields is totally fine.  Many sites (including
ours) use LBNL NHC for all or part of their epilogs' post-job "sanity
checking" of nodes, and -- knock on renewable bamboo -- there have
been no concurrency issues (loops, deadlocks, etc.) reported to either
project to date. :-)

If it helps, I had similar concerns about invoking the `squeue`
command from an NHC run in order to gather job data.  The Man Himself
(Moe Jette, original creator of Slurm and co-founder of SchedMD) was
kind enough to weigh in on the issue (literally, the Issue:
https://github.com/mej/nhc/issues/15), saying in part,

    "I do not believe that you could create a deadlock situation from
     NHC (if you did, I would consider that a Slurm bug)."
               -- https://github.com/mej/nhc/issues/15#issuecomment-217174363

That's not to say you should go hog-wild and fill your epilog script
with all the `s`-commands you can think of.... ;-)  But you can at
least be reasonably confident that draining/offlining a node from an
epilog script will not cause your cluster to implode!

Michael

--
Michael E. Jennings <m...@lanl.gov> - [PGPH: he/him/his/Mr]  --  hpc.lanl.gov
HPC Systems Engineer   --   Platforms Team   --  HPC Systems Group (HPC-SYS)
Strategic Computing Complex, Bldg. 03-2327, Rm. 2341    W: +1 (505) 606-0605
Los Alamos National Laboratory,  P.O. Box 1663,  Los Alamos, NM   87545-0001

Reply via email to