[slurm-users] what is the elegant way to drain node from epilog with self-defined reason?

2022-05-03 Thread taleintervenor
Hi, all: We need to detect some problem at job end timepoint, so we write some detection script in slurm epilog, which should drain the node if check is not passed. I know exit epilog with non-zero code will make slurm automatically drain the node. But in such way, drain reason will all be mar

Re: [slurm-users] what is the elegant way to drain node from epilog with self-defined reason?

2022-05-03 Thread Paul Edmon
We've invoked scontrol in our epilog script for years to close off nodes with out any issue.  What the docs are really referring to is gratuitous use of those commands.  If you have those commands well circumscribed (i.e. only invoked when you have to actually close a node) and only use them wh

Re: [slurm-users] what is the elegant way to drain node from epilog with self-defined reason?

2022-05-03 Thread John Hanks
I've done similar by having the epilog touch a file, then have the node health check (LBNL NHC) act on that file's presence/contents later to do the heavy lifting. There's a window of time/delay where the reason is "Epilog error" before the health check corrects it, but if that's tolerable this mak

Re: [slurm-users] FW: gres/gpu count lower than reported

2022-05-03 Thread Stephan Roth
Hi Jim, I don't know if it makes a difference, but I only ever use the complete numeric suffix within brackets, as in sjc01enadsapp[01-08] Otherwise I'd raise the debug level of slurmd to maximum by setting SlurmdDebug=debug5 in /slurm.conf/, tail /SlurmdLogFile/ on a GPU node and then rest

Re: [slurm-users] gres/gpu count lower than reported

2022-05-03 Thread David Henkemeyer
I have found that the "reason" field doesn't get updated after you correct the issue. For me, its only when I move the node back to the idle state, that the reason field is then reset. So, assuming /dev/nvidia[0-3] is correct (I've never seen otherwise with nvidia GPUs), then try taking them back

Re: [slurm-users] what is the elegant way to drain node from epilog with self-defined reason?

2022-05-03 Thread Michael Jennings
On Tuesday, 03 May 2022, at 15:46:38 (+0800), taleinterve...@sjtu.edu.cn wrote: We need to detect some problem at job end timepoint, so we write some detection script in slurm epilog, which should drain the node if check is not passed. I know exit epilog with non-zero code will make slurm autom