I've done similar by having the epilog touch a file, then have the node health check (LBNL NHC) act on that file's presence/contents later to do the heavy lifting. There's a window of time/delay where the reason is "Epilog error" before the health check corrects it, but if that's tolerable this makes for a fast epilog script.
griznog On Tue, May 3, 2022 at 2:49 AM <taleinterve...@sjtu.edu.cn> wrote: > Hi, all: > > > > We need to detect some problem at job end timepoint, so we write some > detection script in slurm epilog, which should drain the node if check is > not passed. > > I know exit epilog with non-zero code will make slurm automatically drain > the node. But in such way, drain reason will all be marked as *“Epilog > error”*. Then our auto-repair program will have trouble to determine how > to repair the node. > > Another way is call *scontrol* directly from epilog to drain the node, > but from official doc https://slurm.schedmd.com/prolog_epilog.html it > wrote: > > *Prolog and Epilog scripts should be designed to be as short as possible > and should not call Slurm commands (e.g. squeue, scontrol, sacctmgr, etc). > … Slurm commands in these scripts can potentially lead to performance > issues and should not be used.* > > So what is the best way to drain node from epilog with a self-defined > reason, or tell slurm to add more verbose message besides *“Epilog error” > *reason? >