We've invoked scontrol in our epilog script for years to close off nodes with out any issue.  What the docs are really referring to is gratuitous use of those commands.  If you have those commands well circumscribed (i.e. only invoked when you have to actually close a node) and only use them when you absolutely have no other work around then you should be fine.

-Paul Edmon-

On 5/3/2022 3:46 AM, taleinterve...@sjtu.edu.cn wrote:

Hi, all:

We need to detect some problem at job end timepoint, so we write some detection script in slurm epilog, which should drain the node if check is not passed.

I know exit epilog with non-zero code will make slurm automatically drain the node. But in such way, drain reason will all be marked as *“Epilog error”*. Then our auto-repair program will have trouble to determine how to repair the node.

Another way is call *scontrol* directly from epilog to drain the node, but from official doc https://slurm.schedmd.com/prolog_epilog.html it wrote:

/Prolog and Epilog scripts should be designed to be as short as possible and should not call Slurm commands (e.g. squeue, scontrol, sacctmgr, etc). … Slurm commands in these scripts can potentially lead to performance issues and should not be used./

So what is the best way to drain node from epilog with a self-defined reason, or tell slurm to add more verbose message besides *“Epilog error” *reason?

Reply via email to