[slurm-users] what is the elegant way to drain node from epilog with self-defined reason?
Hi, all: We need to detect some problem at job end timepoint, so we write some detection script in slurm epilog, which should drain the node if check is not passed. I know exit epilog with non-zero code will make slurm automatically drain the node. But in such way, drain reason will all be marked as "Epilog error". Then our auto-repair program will have trouble to determine how to repair the node. Another way is call scontrol directly from epilog to drain the node, but from official doc https://slurm.schedmd.com/prolog_epilog.html it wrote: Prolog and Epilog scripts should be designed to be as short as possible and should not call Slurm commands (e.g. squeue, scontrol, sacctmgr, etc). . Slurm commands in these scripts can potentially lead to performance issues and should not be used. So what is the best way to drain node from epilog with a self-defined reason, or tell slurm to add more verbose message besides "Epilog error" reason?
Re: [slurm-users] what is the elegant way to drain node from epilog with self-defined reason?
We've invoked scontrol in our epilog script for years to close off nodes with out any issue. What the docs are really referring to is gratuitous use of those commands. If you have those commands well circumscribed (i.e. only invoked when you have to actually close a node) and only use them when you absolutely have no other work around then you should be fine. -Paul Edmon- On 5/3/2022 3:46 AM, taleinterve...@sjtu.edu.cn wrote: Hi, all: We need to detect some problem at job end timepoint, so we write some detection script in slurm epilog, which should drain the node if check is not passed. I know exit epilog with non-zero code will make slurm automatically drain the node. But in such way, drain reason will all be marked as *“Epilog error”*. Then our auto-repair program will have trouble to determine how to repair the node. Another way is call *scontrol* directly from epilog to drain the node, but from official doc https://slurm.schedmd.com/prolog_epilog.html it wrote: /Prolog and Epilog scripts should be designed to be as short as possible and should not call Slurm commands (e.g. squeue, scontrol, sacctmgr, etc). … Slurm commands in these scripts can potentially lead to performance issues and should not be used./ So what is the best way to drain node from epilog with a self-defined reason, or tell slurm to add more verbose message besides *“Epilog error” *reason?
Re: [slurm-users] what is the elegant way to drain node from epilog with self-defined reason?
I've done similar by having the epilog touch a file, then have the node health check (LBNL NHC) act on that file's presence/contents later to do the heavy lifting. There's a window of time/delay where the reason is "Epilog error" before the health check corrects it, but if that's tolerable this makes for a fast epilog script. griznog On Tue, May 3, 2022 at 2:49 AM wrote: > Hi, all: > > > > We need to detect some problem at job end timepoint, so we write some > detection script in slurm epilog, which should drain the node if check is > not passed. > > I know exit epilog with non-zero code will make slurm automatically drain > the node. But in such way, drain reason will all be marked as *“Epilog > error”*. Then our auto-repair program will have trouble to determine how > to repair the node. > > Another way is call *scontrol* directly from epilog to drain the node, > but from official doc https://slurm.schedmd.com/prolog_epilog.html it > wrote: > > *Prolog and Epilog scripts should be designed to be as short as possible > and should not call Slurm commands (e.g. squeue, scontrol, sacctmgr, etc). > … Slurm commands in these scripts can potentially lead to performance > issues and should not be used.* > > So what is the best way to drain node from epilog with a self-defined > reason, or tell slurm to add more verbose message besides *“Epilog error” > *reason? >
Re: [slurm-users] FW: gres/gpu count lower than reported
Hi Jim, I don't know if it makes a difference, but I only ever use the complete numeric suffix within brackets, as in sjc01enadsapp[01-08] Otherwise I'd raise the debug level of slurmd to maximum by setting SlurmdDebug=debug5 in /slurm.conf/, tail /SlurmdLogFile/ on a GPU node and then restart /slurmd/ there. This might shed some light on what goes wrong. Cheers, Stephan On 03.05.22 20:51, Jim Kavitsky wrote: Whoops. Sent the first to an incorrect address….apologies if this shows up as a duplicate. -jimk *From: *Jim Kavitsky *Date: *Tuesday, May 3, 2022 at 11:46 AM *To: *slurm-us...@schedmd.com *Subject: *gres/gpu count lower than reported Hello Fellow Slurm Admins, I have a new Slurm installation that was working and running basic test jobs until I added gpu support. My worker nodes are now all in drain state, with gres/gpu count reported lower than configured (0 < 4) This is in spite of the fact that nvidia-smi reports all four A100’s as active on each node. I have spent a good chunk of a week googling around for the solution to this, and trying variants of the gpu config lines/restarting daemons without any luck. The relevant lines from my current config files are below. The head node and all workers have the same gres.conf and slurm.conf files. Can anyone suggest anything else I should be looking at or adding? I’m guessing that this is a problem that many have faced, and any guidance would be greatly appreciated. root@sjc01enadsapp00:/etc/slurm-llnl# grep gpu slurm.conf GresTypes=*gpu* NodeName=sjc01enadsapp0[1-8] RealMemory=2063731 Sockets=2 CoresPerSocket=16 ThreadsPerCore=2 Gres=*gpu*:4 State=UNKNOWN root@sjc01enadsapp00:/etc/slurm-llnl# cat gres.conf NodeName=sjc01enadsapp0[1-8] Name=gpu File=/dev/nvidia[0-3] root@sjc01enadsapp00:~# sinfo -N -o "%.20N %.15C %.10t %.10m %.15P %.15G %.75E" NODELIST CPUS(A/I/O/T)STATE MEMORY PARTITIONGRESREASON sjc01enadsapp01 0/0/64/64drain2063731Primary* gpu:4 gres/gpu count reported lower than configured (0 < 4) sjc01enadsapp02 0/0/64/64drain2063731Primary* gpu:4 gres/gpu count reported lower than configured (0 < 4) sjc01enadsapp03 0/0/64/64drain2063731Primary* gpu:4 gres/gpu count reported lower than configured (0 < 4) sjc01enadsapp04 0/0/64/64drain2063731Primary* gpu:4 gres/gpu count reported lower than configured (0 < 4) sjc01enadsapp05 0/0/64/64drain2063731Primary* gpu:4 gres/gpu count reported lower than configured (0 < 4) sjc01enadsapp06 0/0/64/64drain2063731Primary* gpu:4 gres/gpu count reported lower than configured (0 < 4) sjc01enadsapp07 0/0/64/64drain2063731Primary* gpu:4 gres/gpu count reported lower than configured (0 < 4) sjc01enadsapp08 0/0/64/64drain2063731Primary* gpu:4 gres/gpu count reported lower than configured (0 < 4) root@sjc01enadsapp07:~# nvidia-smi Tue May 3 18:41:34 2022 +-+ | NVIDIA-SMI 470.103.01 Driver Version: 470.103.01 CUDA Version: 11.4 | |---+--+--+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===+==+==| | 0 NVIDIA A100-PCI... On | :17:00.0 Off | 0 | | N/A 42C P0 49W / 250W | 4MiB / 40536MiB | 0% Default | | | | Disabled | +---+--+--+ | 1 NVIDIA A100-PCI... On | :65:00.0 Off | 0 | | N/A 41C P0 48W / 250W | 4MiB / 40536MiB | 0% Default | | | | Disabled | +---+--+--+ | 2 NVIDIA A100-PCI... On | :CA:00.0 Off | 0 | | N/A 35C P0 44W / 250W | 4MiB / 40536MiB | 0% Default | | | | Disabled | +---+--+--+ | 3 NVIDIA A100-PCI... On | :E3:00.0 Off | 0 | | N/A 38C P0 45W / 250W | 4MiB / 40536MiB | 0% Default | | | | Disabled | +---+--+--+ +-+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=| | 0 N/A N/A 2179 G /usr/lib/xorg/Xorg 4MiB | | 1 N/A N/A 2179
Re: [slurm-users] gres/gpu count lower than reported
I have found that the "reason" field doesn't get updated after you correct the issue. For me, its only when I move the node back to the idle state, that the reason field is then reset. So, assuming /dev/nvidia[0-3] is correct (I've never seen otherwise with nvidia GPUs), then try taking them back into the idle state. Also, keep an eye on the slurmctld and slurmd logs. They usually are quite helpful to highlight what the issue is. David On Tue, May 3, 2022 at 11:50 AM Jim Kavitsky wrote: > Hello Fellow Slurm Admins, > > > > I have a new Slurm installation that was working and running basic test > jobs until I added gpu support. My worker nodes are now all in drain state, > with gres/gpu count reported lower than configured (0 < 4) > > > > This is in spite of the fact that nvidia-smi reports all four A100’s as > active on each node. I have spent a good chunk of a week googling around > for the solution to this, and trying variants of the gpu config > lines/restarting daemons without any luck. > > > > The relevant lines from my current config files are below. The head node > and all workers have the same gres.conf and slurm.conf files. Can anyone > suggest anything else I should be looking at or adding? I’m guessing that > this is a problem that many have faced, and any guidance would be greatly > appreciated. > > > > root@sjc01enadsapp00:/etc/slurm-llnl# grep gpu slurm.conf > > GresTypes=*gpu* > > NodeName=sjc01enadsapp0[1-8] RealMemory=2063731 Sockets=2 > CoresPerSocket=16 ThreadsPerCore=2 Gres=*gpu*:4 State=UNKNOWN > > > > root@sjc01enadsapp00:/etc/slurm-llnl# cat gres.conf > > NodeName=sjc01enadsapp0[1-8] Name=gpu File=/dev/nvidia[0-3] > > > > > > > > root@sjc01enadsapp00:~# sinfo -N -o "%.20N %.15C %.10t %.10m %.15P %.15G > %.75E" > > NODELIST CPUS(A/I/O/T) STATE MEMORY PARTITION > GRES > REASON > > sjc01enadsapp01 0/0/64/64 drain2063731Primary* > gpu:4 gres/gpu count reported lower than > configured (0 < 4) > > sjc01enadsapp02 0/0/64/64 drain2063731Primary* > gpu:4 gres/gpu count reported lower than > configured (0 < 4) > > sjc01enadsapp03 0/0/64/64 drain2063731Primary* > gpu:4 gres/gpu count reported lower than > configured (0 < 4) > > sjc01enadsapp04 0/0/64/64 drain2063731Primary* > gpu:4 gres/gpu count reported lower than > configured (0 < 4) > > sjc01enadsapp05 0/0/64/64 drain2063731Primary* > gpu:4 gres/gpu count reported lower than > configured (0 < 4) > > sjc01enadsapp06 0/0/64/64 drain2063731Primary* > gpu:4 gres/gpu count reported lower than > configured (0 < 4) > > sjc01enadsapp07 0/0/64/64 drain2063731Primary* > gpu:4 gres/gpu count reported lower than > configured (0 < 4) > > sjc01enadsapp08 0/0/64/64 drain2063731Primary* > gpu:4 gres/gpu count reported lower than > configured (0 < 4) > > > > > > root@sjc01enadsapp07:~# nvidia-smi > > Tue May 3 18:41:34 2022 > > > +-+ > > | NVIDIA-SMI 470.103.01 Driver Version: 470.103.01 CUDA Version: 11.4 > | > > > |---+--+--+ > > | GPU NamePersistence-M| Bus-IdDisp.A | Volatile Uncorr. > ECC | > > | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute > M. | > > | | | MIG > M. | > > > |===+==+==| > > | 0 NVIDIA A100-PCI... On | :17:00.0 Off | > 0 | > > | N/A 42CP049W / 250W | 4MiB / 40536MiB | 0% > Default | > > | | | > Disabled | > > > +---+--+--+ > > | 1 NVIDIA A100-PCI... On | :65:00.0 Off | > 0 | > > | N/A 41CP048W / 250W | 4MiB / 40536MiB | 0% > Default | > > | | | > Disabled | > > > +---+--+--+ > > | 2 NVIDIA A100-PCI... On | :CA:00.0 Off | > 0 | > > | N/A 35CP044W / 250W | 4MiB / 40536MiB | 0% > Default | > > | | | > Disabled | > > > +---+--+--+ > > | 3 NVIDIA A100-PCI... On | :E3:00.0 Off | > 0 | > > | N/A 38CP045W / 250W
Re: [slurm-users] what is the elegant way to drain node from epilog with self-defined reason?
On Tuesday, 03 May 2022, at 15:46:38 (+0800), taleinterve...@sjtu.edu.cn wrote: We need to detect some problem at job end timepoint, so we write some detection script in slurm epilog, which should drain the node if check is not passed. I know exit epilog with non-zero code will make slurm automatically drain the node. But in such way, drain reason will all be marked as "Epilog error". Then our auto-repair program will have trouble to determine how to repair the node. Another way is call scontrol directly from epilog to drain the node, but from official doc https://slurm.schedmd.com/prolog_epilog.html it wrote: Prolog and Epilog scripts should be designed to be as short as possible and should not call Slurm commands (e.g. squeue, scontrol, sacctmgr, etc). . Slurm commands in these scripts can potentially lead to performance issues and should not be used. So what is the best way to drain node from epilog with a self-defined reason, or tell slurm to add more verbose message besides "Epilog error" reason? Invoking `scontrol` from a prolog/epilog script to simply alter nodes' state and/or reason fields is totally fine. Many sites (including ours) use LBNL NHC for all or part of their epilogs' post-job "sanity checking" of nodes, and -- knock on renewable bamboo -- there have been no concurrency issues (loops, deadlocks, etc.) reported to either project to date. :-) If it helps, I had similar concerns about invoking the `squeue` command from an NHC run in order to gather job data. The Man Himself (Moe Jette, original creator of Slurm and co-founder of SchedMD) was kind enough to weigh in on the issue (literally, the Issue: https://github.com/mej/nhc/issues/15), saying in part, "I do not believe that you could create a deadlock situation from NHC (if you did, I would consider that a Slurm bug)." -- https://github.com/mej/nhc/issues/15#issuecomment-217174363 That's not to say you should go hog-wild and fill your epilog script with all the `s`-commands you can think of ;-) But you can at least be reasonably confident that draining/offlining a node from an epilog script will not cause your cluster to implode! Michael -- Michael E. Jennings - [PGPH: he/him/his/Mr] -- hpc.lanl.gov HPC Systems Engineer -- Platforms Team -- HPC Systems Group (HPC-SYS) Strategic Computing Complex, Bldg. 03-2327, Rm. 2341W: +1 (505) 606-0605 Los Alamos National Laboratory, P.O. Box 1663, Los Alamos, NM 87545-0001