Hi, all:
We need to detect some problem at job end timepoint, so we write some
detection script in slurm epilog, which should drain the node if check is
not passed.
I know exit epilog with non-zero code will make slurm automatically drain
the node. But in such way, drain reason will all be mar
We've invoked scontrol in our epilog script for years to close off nodes
with out any issue. What the docs are really referring to is gratuitous
use of those commands. If you have those commands well circumscribed
(i.e. only invoked when you have to actually close a node) and only use
them wh
I've done similar by having the epilog touch a file, then have the node
health check (LBNL NHC) act on that file's presence/contents later to do
the heavy lifting. There's a window of time/delay where the reason is
"Epilog error" before the health check corrects it, but if that's tolerable
this mak
Hi Jim,
I don't know if it makes a difference, but I only ever use the complete
numeric suffix within brackets, as in
sjc01enadsapp[01-08]
Otherwise I'd raise the debug level of slurmd to maximum by setting
SlurmdDebug=debug5
in /slurm.conf/, tail /SlurmdLogFile/ on a GPU node and then rest
I have found that the "reason" field doesn't get updated after you correct
the issue. For me, its only when I move the node back to the idle state,
that the reason field is then reset. So, assuming /dev/nvidia[0-3] is
correct (I've never seen otherwise with nvidia GPUs), then try taking them
back
On Tuesday, 03 May 2022, at 15:46:38 (+0800),
taleinterve...@sjtu.edu.cn wrote:
We need to detect some problem at job end timepoint, so we write some
detection script in slurm epilog, which should drain the node if check is
not passed.
I know exit epilog with non-zero code will make slurm autom