We've invoked scontrol in our epilog script for years to close off nodes
with out any issue. What the docs are really referring to is gratuitous
use of those commands. If you have those commands well circumscribed
(i.e. only invoked when you have to actually close a node) and only use
them when you absolutely have no other work around then you should be fine.
-Paul Edmon-
On 5/3/2022 3:46 AM, taleinterve...@sjtu.edu.cn wrote:
Hi, all:
We need to detect some problem at job end timepoint, so we write some
detection script in slurm epilog, which should drain the node if check
is not passed.
I know exit epilog with non-zero code will make slurm automatically
drain the node. But in such way, drain reason will all be marked as
*“Epilog error”*. Then our auto-repair program will have trouble to
determine how to repair the node.
Another way is call *scontrol* directly from epilog to drain the node,
but from official doc https://slurm.schedmd.com/prolog_epilog.html it
wrote:
/Prolog and Epilog scripts should be designed to be as short as
possible and should not call Slurm commands (e.g. squeue, scontrol,
sacctmgr, etc). … Slurm commands in these scripts can potentially lead
to performance issues and should not be used./
So what is the best way to drain node from epilog with a self-defined
reason, or tell slurm to add more verbose message besides *“Epilog
error” *reason?