On Tuesday, 03 May 2022, at 15:46:38 (+0800), taleinterve...@sjtu.edu.cn wrote:
We need to detect some problem at job end timepoint, so we write some detection script in slurm epilog, which should drain the node if check is not passed. I know exit epilog with non-zero code will make slurm automatically drain the node. But in such way, drain reason will all be marked as "Epilog error". Then our auto-repair program will have trouble to determine how to repair the node. Another way is call scontrol directly from epilog to drain the node, but from official doc https://slurm.schedmd.com/prolog_epilog.html it wrote: Prolog and Epilog scripts should be designed to be as short as possible and should not call Slurm commands (e.g. squeue, scontrol, sacctmgr, etc). . Slurm commands in these scripts can potentially lead to performance issues and should not be used. So what is the best way to drain node from epilog with a self-defined reason, or tell slurm to add more verbose message besides "Epilog error" reason?
Invoking `scontrol` from a prolog/epilog script to simply alter nodes' state and/or reason fields is totally fine. Many sites (including ours) use LBNL NHC for all or part of their epilogs' post-job "sanity checking" of nodes, and -- knock on renewable bamboo -- there have been no concurrency issues (loops, deadlocks, etc.) reported to either project to date. :-) If it helps, I had similar concerns about invoking the `squeue` command from an NHC run in order to gather job data. The Man Himself (Moe Jette, original creator of Slurm and co-founder of SchedMD) was kind enough to weigh in on the issue (literally, the Issue: https://github.com/mej/nhc/issues/15), saying in part, "I do not believe that you could create a deadlock situation from NHC (if you did, I would consider that a Slurm bug)." -- https://github.com/mej/nhc/issues/15#issuecomment-217174363 That's not to say you should go hog-wild and fill your epilog script with all the `s`-commands you can think of.... ;-) But you can at least be reasonably confident that draining/offlining a node from an epilog script will not cause your cluster to implode! Michael -- Michael E. Jennings <m...@lanl.gov> - [PGPH: he/him/his/Mr] -- hpc.lanl.gov HPC Systems Engineer -- Platforms Team -- HPC Systems Group (HPC-SYS) Strategic Computing Complex, Bldg. 03-2327, Rm. 2341 W: +1 (505) 606-0605 Los Alamos National Laboratory, P.O. Box 1663, Los Alamos, NM 87545-0001