[slurm-users] what is the elegant way to drain node from epilog with self-defined reason?

2022-05-03 Thread taleintervenor
Hi, all:

 

We need to detect some problem at job end timepoint, so we write some
detection script in slurm epilog, which should drain the node if check is
not passed.

I know exit epilog with non-zero code will make slurm automatically drain
the node. But in such way, drain reason will all be marked as "Epilog
error". Then our auto-repair program will have trouble to determine how to
repair the node.

Another way is call scontrol directly from epilog to drain the node, but
from official doc https://slurm.schedmd.com/prolog_epilog.html it wrote:

Prolog and Epilog scripts should be designed to be as short as possible and
should not call Slurm commands (e.g. squeue, scontrol, sacctmgr, etc). .
Slurm commands in these scripts can potentially lead to performance issues
and should not be used.

So what is the best way to drain node from epilog with a self-defined
reason, or tell slurm to add more verbose message besides "Epilog error"
reason?



Re: [slurm-users] what is the elegant way to drain node from epilog with self-defined reason?

2022-05-03 Thread Paul Edmon
We've invoked scontrol in our epilog script for years to close off nodes 
with out any issue.  What the docs are really referring to is gratuitous 
use of those commands.  If you have those commands well circumscribed 
(i.e. only invoked when you have to actually close a node) and only use 
them when you absolutely have no other work around then you should be fine.


-Paul Edmon-

On 5/3/2022 3:46 AM, taleinterve...@sjtu.edu.cn wrote:


Hi, all:

We need to detect some problem at job end timepoint, so we write some 
detection script in slurm epilog, which should drain the node if check 
is not passed.


I know exit epilog with non-zero code will make slurm automatically 
drain the node. But in such way, drain reason will all be marked as 
*“Epilog error”*. Then our auto-repair program will have trouble to 
determine how to repair the node.


Another way is call *scontrol* directly from epilog to drain the node, 
but from official doc https://slurm.schedmd.com/prolog_epilog.html it 
wrote:


/Prolog and Epilog scripts should be designed to be as short as 
possible and should not call Slurm commands (e.g. squeue, scontrol, 
sacctmgr, etc). … Slurm commands in these scripts can potentially lead 
to performance issues and should not be used./


So what is the best way to drain node from epilog with a self-defined 
reason, or tell slurm to add more verbose message besides *“Epilog 
error” *reason?


Re: [slurm-users] what is the elegant way to drain node from epilog with self-defined reason?

2022-05-03 Thread John Hanks
I've done similar by having the epilog touch a file, then have the node
health check (LBNL NHC) act on that file's presence/contents later to do
the heavy lifting. There's a window of time/delay where the reason is
"Epilog error" before the health check corrects it, but if that's tolerable
this makes for a fast epilog script.

griznog

On Tue, May 3, 2022 at 2:49 AM  wrote:

> Hi, all:
>
>
>
> We need to detect some problem at job end timepoint, so we write some
> detection script in slurm epilog, which should drain the node if check is
> not passed.
>
> I know exit epilog with non-zero code will make slurm automatically drain
> the node. But in such way, drain reason will all be marked as *“Epilog
> error”*. Then our auto-repair program will have trouble to determine how
> to repair the node.
>
> Another way is call *scontrol* directly from epilog to drain the node,
> but from official doc https://slurm.schedmd.com/prolog_epilog.html it
> wrote:
>
> *Prolog and Epilog scripts should be designed to be as short as possible
> and should not call Slurm commands (e.g. squeue, scontrol, sacctmgr, etc).
> … Slurm commands in these scripts can potentially lead to performance
> issues and should not be used.*
>
> So what is the best way to drain node from epilog with a self-defined
> reason, or tell slurm to add more verbose message besides *“Epilog error”
> *reason?
>


Re: [slurm-users] FW: gres/gpu count lower than reported

2022-05-03 Thread Stephan Roth

Hi Jim,

I don't know if it makes a difference, but I only ever use the complete 
numeric suffix within brackets, as in


sjc01enadsapp[01-08]

Otherwise I'd raise the debug level of slurmd to maximum by setting

SlurmdDebug=debug5

in /slurm.conf/, tail /SlurmdLogFile/ on a GPU node and then restart 
/slurmd/ there.

This might shed some light on what goes wrong.

Cheers,
Stephan

On 03.05.22 20:51, Jim Kavitsky wrote:


Whoops. Sent the first to an incorrect address….apologies if this 
shows up as a duplicate.


-jimk

*From: *Jim Kavitsky 
*Date: *Tuesday, May 3, 2022 at 11:46 AM
*To: *slurm-us...@schedmd.com 
*Subject: *gres/gpu count lower than reported

Hello Fellow Slurm Admins,

I have a new Slurm installation that was working and running basic 
test jobs until I added gpu support. My worker nodes are now all in 
drain state, with gres/gpu count reported lower than configured (0 < 4)


This is in spite of the fact that nvidia-smi reports all four A100’s 
as active on each node. I have spent a good chunk of a week googling 
around for the solution to this, and trying variants of the gpu config 
lines/restarting daemons without any luck.


The relevant lines from my current config files are below. The head 
node and all workers have the same gres.conf and slurm.conf files. Can 
anyone suggest anything else I should be looking at or adding? I’m 
guessing that this is a problem that many have faced, and any guidance 
would be greatly appreciated.


root@sjc01enadsapp00:/etc/slurm-llnl# grep gpu slurm.conf

GresTypes=*gpu*

NodeName=sjc01enadsapp0[1-8] RealMemory=2063731 Sockets=2 
CoresPerSocket=16 ThreadsPerCore=2 Gres=*gpu*:4 State=UNKNOWN


root@sjc01enadsapp00:/etc/slurm-llnl# cat gres.conf

NodeName=sjc01enadsapp0[1-8] Name=gpu File=/dev/nvidia[0-3]

root@sjc01enadsapp00:~# sinfo -N -o "%.20N %.15C %.10t %.10m %.15P 
%.15G %.75E"


NODELIST CPUS(A/I/O/T)STATE MEMORY PARTITIONGRESREASON

sjc01enadsapp01 0/0/64/64drain2063731Primary* gpu:4 gres/gpu count 
reported lower than configured (0 < 4)


sjc01enadsapp02 0/0/64/64drain2063731Primary* gpu:4 gres/gpu count 
reported lower than configured (0 < 4)


sjc01enadsapp03 0/0/64/64drain2063731Primary* gpu:4 gres/gpu count 
reported lower than configured (0 < 4)


sjc01enadsapp04 0/0/64/64drain2063731Primary* gpu:4 gres/gpu count 
reported lower than configured (0 < 4)


sjc01enadsapp05 0/0/64/64drain2063731Primary* gpu:4 gres/gpu count 
reported lower than configured (0 < 4)


sjc01enadsapp06 0/0/64/64drain2063731Primary* gpu:4 gres/gpu count 
reported lower than configured (0 < 4)


sjc01enadsapp07 0/0/64/64drain2063731Primary* gpu:4 gres/gpu count 
reported lower than configured (0 < 4)


sjc01enadsapp08 0/0/64/64drain2063731Primary* gpu:4 gres/gpu count 
reported lower than configured (0 < 4)


root@sjc01enadsapp07:~# nvidia-smi

Tue May  3 18:41:34 2022

+-+

| NVIDIA-SMI 470.103.01   Driver Version: 470.103.01   CUDA Version: 
11.4     |


|---+--+--+

| GPU Name        Persistence-M| Bus-Id        Disp.A | Volatile 
Uncorr. ECC |


| Fan Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util 
Compute M. |


|                           |                      |     MIG M. |

|===+==+==|

|   0 NVIDIA A100-PCI...  On   | :17:00.0 Off |         0 |

| N/A   42C    P0    49W / 250W |      4MiB / 40536MiB |      0%     
Default |


|                           |                      |   Disabled |

+---+--+--+

|   1 NVIDIA A100-PCI...  On   | :65:00.0 Off |         0 |

| N/A   41C    P0    48W / 250W |      4MiB / 40536MiB |      0%     
Default |


|                           |                      |   Disabled |

+---+--+--+

|   2 NVIDIA A100-PCI...  On   | :CA:00.0 Off |         0 |

| N/A   35C    P0    44W / 250W |      4MiB / 40536MiB |      0%     
Default |


|                           |                      |   Disabled |

+---+--+--+

|   3 NVIDIA A100-PCI...  On   | :E3:00.0 Off |         0 |

| N/A   38C    P0    45W / 250W |      4MiB / 40536MiB |      0%     
Default |


|                           |                      |   Disabled |

+---+--+--+

+-+

| Processes:                 |

|  GPU   GI   CI        PID   Type   Process name GPU Memory |

|   ID   ID Usage      |

|=|

|    0   N/A  N/A      2179      G   /usr/lib/xorg/Xorg       4MiB |

|    1   N/A  N/A      2179  

Re: [slurm-users] gres/gpu count lower than reported

2022-05-03 Thread David Henkemeyer
I have found that the "reason" field doesn't get updated after you correct
the issue.  For me, its only when I move the node back to the idle state,
that the reason field is then reset.  So, assuming /dev/nvidia[0-3] is
correct (I've never seen otherwise with nvidia GPUs), then try taking them
back into the idle state.  Also, keep an eye on the slurmctld and slurmd
logs.  They usually are quite helpful to highlight what the issue is.

David

On Tue, May 3, 2022 at 11:50 AM Jim Kavitsky 
wrote:

> Hello Fellow Slurm Admins,
>
>
>
> I have a new Slurm installation that was working and running basic test
> jobs until I added gpu support. My worker nodes are now all in drain state,
> with gres/gpu count reported lower than configured (0 < 4)
>
>
>
> This is in spite of the fact that nvidia-smi reports all four A100’s as
> active on each node. I have spent a good chunk of a week googling around
> for the solution to this, and trying variants of the gpu config
> lines/restarting daemons without any luck.
>
>
>
> The relevant lines from my current config files are below. The head node
> and all workers have the same gres.conf and slurm.conf files. Can anyone
> suggest anything else I should be looking at or adding? I’m guessing that
> this is a problem that many have faced, and any guidance would be greatly
> appreciated.
>
>
>
> root@sjc01enadsapp00:/etc/slurm-llnl# grep gpu slurm.conf
>
> GresTypes=*gpu*
>
> NodeName=sjc01enadsapp0[1-8] RealMemory=2063731 Sockets=2
> CoresPerSocket=16 ThreadsPerCore=2 Gres=*gpu*:4 State=UNKNOWN
>
>
>
> root@sjc01enadsapp00:/etc/slurm-llnl# cat gres.conf
>
> NodeName=sjc01enadsapp0[1-8] Name=gpu File=/dev/nvidia[0-3]
>
>
>
>
>
>
>
> root@sjc01enadsapp00:~# sinfo -N -o "%.20N %.15C %.10t %.10m %.15P %.15G
> %.75E"
>
> NODELIST   CPUS(A/I/O/T)  STATE MEMORY   PARTITION
>   GRES
>   REASON
>
>  sjc01enadsapp01   0/0/64/64  drain2063731Primary*
>   gpu:4   gres/gpu count reported lower than
> configured (0 < 4)
>
>  sjc01enadsapp02   0/0/64/64  drain2063731Primary*
>   gpu:4   gres/gpu count reported lower than
> configured (0 < 4)
>
>  sjc01enadsapp03   0/0/64/64  drain2063731Primary*
>   gpu:4   gres/gpu count reported lower than
> configured (0 < 4)
>
>  sjc01enadsapp04   0/0/64/64  drain2063731Primary*
>   gpu:4   gres/gpu count reported lower than
> configured (0 < 4)
>
>  sjc01enadsapp05   0/0/64/64  drain2063731Primary*
>   gpu:4   gres/gpu count reported lower than
> configured (0 < 4)
>
>  sjc01enadsapp06   0/0/64/64  drain2063731Primary*
>   gpu:4   gres/gpu count reported lower than
> configured (0 < 4)
>
>  sjc01enadsapp07   0/0/64/64  drain2063731Primary*
>   gpu:4   gres/gpu count reported lower than
> configured (0 < 4)
>
>  sjc01enadsapp08   0/0/64/64  drain2063731Primary*
>   gpu:4   gres/gpu count reported lower than
> configured (0 < 4)
>
>
>
>
>
> root@sjc01enadsapp07:~# nvidia-smi
>
> Tue May  3 18:41:34 2022
>
>
> +-+
>
> | NVIDIA-SMI 470.103.01   Driver Version: 470.103.01   CUDA Version: 11.4
> |
>
>
> |---+--+--+
>
> | GPU  NamePersistence-M| Bus-IdDisp.A | Volatile Uncorr.
> ECC |
>
> | Fan  Temp  Perf  Pwr:Usage/Cap| Memory-Usage | GPU-Util  Compute
> M. |
>
> |   |  |   MIG
> M. |
>
>
> |===+==+==|
>
> |   0  NVIDIA A100-PCI...  On   | :17:00.0 Off |
>   0 |
>
> | N/A   42CP049W / 250W |  4MiB / 40536MiB |  0%
> Default |
>
> |   |  |
> Disabled |
>
>
> +---+--+--+
>
> |   1  NVIDIA A100-PCI...  On   | :65:00.0 Off |
>   0 |
>
> | N/A   41CP048W / 250W |  4MiB / 40536MiB |  0%
> Default |
>
> |   |  |
> Disabled |
>
>
> +---+--+--+
>
> |   2  NVIDIA A100-PCI...  On   | :CA:00.0 Off |
>   0 |
>
> | N/A   35CP044W / 250W |  4MiB / 40536MiB |  0%
> Default |
>
> |   |  |
> Disabled |
>
>
> +---+--+--+
>
> |   3  NVIDIA A100-PCI...  On   | :E3:00.0 Off |
>   0 |
>
> | N/A   38CP045W / 250W 

Re: [slurm-users] what is the elegant way to drain node from epilog with self-defined reason?

2022-05-03 Thread Michael Jennings

On Tuesday, 03 May 2022, at 15:46:38 (+0800),
taleinterve...@sjtu.edu.cn wrote:


We need to detect some problem at job end timepoint, so we write some
detection script in slurm epilog, which should drain the node if check is
not passed.

I know exit epilog with non-zero code will make slurm automatically drain
the node. But in such way, drain reason will all be marked as "Epilog
error". Then our auto-repair program will have trouble to determine how to
repair the node.

Another way is call scontrol directly from epilog to drain the node, but
from official doc https://slurm.schedmd.com/prolog_epilog.html  it wrote:

Prolog and Epilog scripts should be designed to be as short as possible and
should not call Slurm commands (e.g. squeue, scontrol, sacctmgr, etc). .
Slurm commands in these scripts can potentially lead to performance issues
and should not be used.

So what is the best way to drain node from epilog with a self-defined
reason, or tell slurm to add more verbose message besides "Epilog error"
reason?


Invoking `scontrol` from a prolog/epilog script to simply alter nodes'
state and/or reason fields is totally fine.  Many sites (including
ours) use LBNL NHC for all or part of their epilogs' post-job "sanity
checking" of nodes, and -- knock on renewable bamboo -- there have
been no concurrency issues (loops, deadlocks, etc.) reported to either
project to date. :-)

If it helps, I had similar concerns about invoking the `squeue`
command from an NHC run in order to gather job data.  The Man Himself
(Moe Jette, original creator of Slurm and co-founder of SchedMD) was
kind enough to weigh in on the issue (literally, the Issue:
https://github.com/mej/nhc/issues/15), saying in part,

"I do not believe that you could create a deadlock situation from
 NHC (if you did, I would consider that a Slurm bug)."
   -- https://github.com/mej/nhc/issues/15#issuecomment-217174363

That's not to say you should go hog-wild and fill your epilog script
with all the `s`-commands you can think of ;-)  But you can at
least be reasonably confident that draining/offlining a node from an
epilog script will not cause your cluster to implode!

Michael

--
Michael E. Jennings  - [PGPH: he/him/his/Mr]  --  hpc.lanl.gov
HPC Systems Engineer   --   Platforms Team   --  HPC Systems Group (HPC-SYS)
Strategic Computing Complex, Bldg. 03-2327, Rm. 2341W: +1 (505) 606-0605
Los Alamos National Laboratory,  P.O. Box 1663,  Los Alamos, NM   87545-0001