Hi Community, just wanted to share that this problem got solved with the help of pyxis developers https://github.com/NVIDIA/pyxis/issues/47
The solution was to add ConstrainDevices=yes as it was missing in the cgroup.conf file On Thu, May 13, 2021 at 5:14 PM Cristóbal Navarro < cristobal.navarr...@gmail.com> wrote: > Hi Sean and Community, > Some days ago I changed to the cons_tres plugin and also made > AutoDetect=nvml work for gres.conf (attached at the end of the email), Node > and partition definitions seem to be OK (attached at the end as well). > I believe the SLURM setup is just a few steps of being properly set up, > currently I have two very basic scenarios that are giving me > questions/problems, : > > *For 1) Running GPU jobs without containers*: > I was expecting that when doing for example "srun -p gpu --gres=gpu:A100:1 > nvidia-smi -L", the output would be just 1 GPU. However it is not the case. > ➜ TUT03-GPU-single srun -p gpu --gres=gpu:A100:1 nvidia-smi -L > GPU 0: A100-SXM4-40GB (UUID: GPU-baa4736e-088f-77ce-0290-ba745327ca95) > GPU 1: A100-SXM4-40GB (UUID: GPU-d40a3b1b-006b-37de-8b72-669c59d14954) > GPU 2: A100-SXM4-40GB (UUID: GPU-35a012ac-2b34-b68f-d922-24aa07af1be6) > GPU 3: A100-SXM4-40GB (UUID: GPU-b75a4bf8-123b-a8c0-dc75-7709626ead20) > GPU 4: A100-SXM4-40GB (UUID: GPU-9366ff9f-a20a-004e-36eb-8376655b1419) > GPU 5: A100-SXM4-40GB (UUID: GPU-75da7cd5-daf3-10fd-2c3f-56259c1dc777) > GPU 6: A100-SXM4-40GB (UUID: GPU-f999e415-54e5-9d7f-0c4b-1d4d98a1dbfc) > GPU 7: A100-SXM4-40GB (UUID: GPU-cce4a787-1b22-bed7-1e93-612906567a0e) > > But still, when opening an interactive session It really provides 1 GPU. > ➜ TUT03-GPU-single srun -p gpu --gres=gpu:A100:1 --pty bash > > user@nodeGPU01:$ echo $CUDA_VISIBLE_DEVICES > 2 > > Moreover, I tried running simultaneous jobs, each one with > --gres=gpu:A100:1 and the source code logically choosing GPU ID 0, and > indeed different physical GPUs get used which is great. My only concern > here for *1) *is that list that is always displaying all of the devices. > It could confuse users, making them think they have all those GPUs at their > disposal leading to take wrong decisions. Nevertheless, this issue is not > critical compared to the next one. > > *2) Running GPU jobs with containers (pyxis + enroot)* > For this case, the list of GPUs does get reduced to the number of select > devices with gres, however there seems to be a problem when referring to > GPU IDs from inside the container, and the mapping to the physical GPUs, > giving a runtime error in CUDA. > > Doing nvidia-smi gives > ➜ TUT03-GPU-single srun -p gpu --container-name=cuda-11.2.2 > --container-image=cuda-11.2.2 --pty --gres=gpu:A100:1 nvidia-smi -L > > GPU 0: A100-SXM4-40GB (UUID: GPU-35a012ac-2b34-b68f-d922-24aa07af1be6) > As we can see, physical GPU2 is allocated (we can check with the UUID). > From what I understand from the idea of SLURM, the programmer does not need > to know that this is GPU ID 2, he/she can just develop a program thinking > on GPU ID 0 because there is only 1 GPU allocated. That is how it worked in > case 1), otherwise one could not know which GPU ID is the one available. > > Now, If I launch a job with --gres=gpu:A100:1,something like a CUDA matrix > multiply with some nvml info printed I get > ➜ TUT03-GPU-single srun -p gpu --container-name=cuda-11.2.2 > --container-image=cuda-11.2.2 --pty --gres=gpu:A100:1 ./prog 0 $((1024*40)) > 1 > Driver version: 450.102.04 > NUM GPUS = 1 > Listing devices: > GPU0 A100-SXM4-40GB, index=0, > UUID=GPU-35a012ac-2b34-b68f-d922-24aa07af1be6 -> util = 0% > Choosing GPU 0 > GPUassert: no CUDA-capable device is detected main.cu 112 > srun: error: nodeGPU01: task 0: Exited with exit code 100 > > the "index=.." is the GPU index given by nvml. > Now If I do --gres=gpu:A100:3, the real first GPU gets allocated, and the > program works, but It is not the way in which SLURM should work. > ➜ TUT03-GPU-single srun -p gpu --container-name=cuda-11.2.2 > --container-image=cuda-11.2.2 --pty --gres=gpu:A100:3 ./prog 0 $((1024*40)) > 1 > Driver version: 450.102.04 > NUM GPUS = 3 > Listing devices: > GPU0 A100-SXM4-40GB, index=0, > UUID=GPU-baa4736e-088f-77ce-0290-ba745327ca95 -> util = 0% > GPU1 A100-SXM4-40GB, index=1, > UUID=GPU-35a012ac-2b34-b68f-d922-24aa07af1be6 -> util = 0% > GPU2 A100-SXM4-40GB, index=2, > UUID=GPU-b75a4bf8-123b-a8c0-dc75-7709626ead20 -> util = 0% > Choosing GPU 0 > initializing A and B.......done > matmul shared mem..........done: time: 26.546274 secs > copying result to host.....done > verifying result...........done > > I find that very strange that when using containers, the GPU0 from inside > the JOB seems to be trying to access the real physical GPU0 from the > machine, and not the GPU0 provided by SLURM as in 1) which worked well. > > If anyone has advice where to look for any of the two issues, I would > really appreciate it > Many thanks in advance and sorry for this long email. > -- Cristobal > > > --------------------- > CONFIG FILES > *# gres.conf* > ➜ ~ cat /etc/slurm/gres.conf > AutoDetect=nvml > > > > *# slurm.conf* > > *....* > ## Basic scheduling > SelectType=select/cons_tres > SelectTypeParameters=CR_Core_Memory,CR_ONE_TASK_PER_CORE > SchedulerType=sched/backfill > > ## Accounting > AccountingStorageType=accounting_storage/slurmdbd > AccountingStoreJobComment=YES > JobAcctGatherFrequency=30 > JobAcctGatherType=jobacct_gather/linux > AccountingStorageHost=10.10.0.1 > > TaskPlugin=task/cgroup > ProctrackType=proctrack/cgroup > > ## scripts > Epilog=/etc/slurm/epilog > Prolog=/etc/slurm/prolog > PrologFlags=Alloc > > ## Nodes list > NodeName=nodeGPU01 SocketsPerBoard=8 CoresPerSocket=16 ThreadsPerCore=1 > RealMemory=1024000 MemSpecLimit=65556 State=UNKNOWN Gres=gpu:A100:8 > Feature=gpu > > ## Partitions list > PartitionName=gpu OverSubscribe=No MaxCPUsPerNode=64 DefMemPerNode=65556 > DefCpuPerGPU=8 DefMemPerGPU=65556 MaxMemPerNode=532000 MaxTime=1-00:00:00 > State=UP Nodes=nodeGPU01 Default=YES > PartitionName=cpu OverSubscribe=No MaxCPUsPerNode=64 DefMemPerNode=16384 > MaxMemPerNode=420000 MaxTime=1-00:00:00 State=UP Nodes=nodeGPU01 > > On Tue, Apr 13, 2021 at 9:38 PM Cristóbal Navarro < > cristobal.navarr...@gmail.com> wrote: > >> Hi Sean, >> Sorry for the delay, >> The problem got solved accidentally by restarting the slurm services on >> the head node. >> Maybe it was an unfortunate combination of changes done, for which I was >> assuming "scontrol reconfigure" would apply them all properly. >> >> Anyways, I will follow your advice and try changing to to "cons_tres" >> plugin >> Will post back with the result. >> best and many thanks >> >> On Mon, Apr 12, 2021 at 6:35 AM Sean Crosby <scro...@unimelb.edu.au> >> wrote: >> >>> Hi Cristobal, >>> >>> The weird stuff I see in your job is >>> >>> [2021-04-11T01:12:23.270] gres:gpu(7696487) type:(null)(0) job:1317 >>> flags: state >>> [2021-04-11T01:12:23.270] gres_per_node:1 node_cnt:0 >>> [2021-04-11T01:12:23.270] ntasks_per_gres:65534 >>> >>> Not sure why ntasks_per_gres is 65534 and node_cnt is 0. >>> >>> Can you try >>> >>> srun --gres=gpu:A100:1 --mem=10G --cpus-per-gpu=1 --nodes=1 nvidia-smi >>> >>> and post the output of slurmctld.log? >>> >>> I also recommend changing from cons_res to cons_tres for SelectType >>> >>> e.g. >>> >>> SelectType=select/cons_tres >>> SelectTypeParameters=CR_Core_Memory,CR_ONE_TASK_PER_CORE >>> >>> Sean >>> >>> -- >>> Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead >>> Research Computing Services | Business Services >>> The University of Melbourne, Victoria 3010 Australia >>> >>> >>> >>> On Mon, 12 Apr 2021 at 00:18, Cristóbal Navarro < >>> cristobal.navarr...@gmail.com> wrote: >>> >>>> * UoM notice: External email. Be cautious of links, attachments, or >>>> impersonation attempts * >>>> ------------------------------ >>>> Hi Sean, >>>> Tried as suggested but still getting the same error. >>>> This is the node configuration visible to 'scontrol' just in case >>>> ➜ scontrol show node >>>> NodeName=nodeGPU01 Arch=x86_64 CoresPerSocket=16 >>>> CPUAlloc=0 CPUTot=256 CPULoad=8.07 >>>> AvailableFeatures=ht,gpu >>>> ActiveFeatures=ht,gpu >>>> Gres=gpu:A100:8 >>>> NodeAddr=nodeGPU01 NodeHostName=nodeGPU01 Version=20.11.2 >>>> OS=Linux 5.4.0-66-generic #74-Ubuntu SMP Wed Jan 27 22:54:38 UTC >>>> 2021 >>>> RealMemory=1024000 AllocMem=0 FreeMem=1019774 Sockets=8 Boards=1 >>>> State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A >>>> MCS_label=N/A >>>> Partitions=gpu,cpu >>>> BootTime=2021-04-09T21:23:14 SlurmdStartTime=2021-04-11T10:11:12 >>>> CfgTRES=cpu=256,mem=1000G,billing=256 >>>> AllocTRES= >>>> CapWatts=n/a >>>> CurrentWatts=0 AveWatts=0 >>>> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s >>>> Comment=(null) >>>> >>>> >>>> >>>> >>>> On Sun, Apr 11, 2021 at 2:03 AM Sean Crosby <scro...@unimelb.edu.au> >>>> wrote: >>>> >>>>> Hi Cristobal, >>>>> >>>>> My hunch is it is due to the default memory/CPU settings. >>>>> >>>>> Does it work if you do >>>>> >>>>> srun --gres=gpu:A100:1 --cpus-per-task=1 --mem=10G nvidia-smi >>>>> >>>>> Sean >>>>> -- >>>>> Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead >>>>> Research Computing Services | Business Services >>>>> The University of Melbourne, Victoria 3010 Australia >>>>> >>>>> >>>>> >>>>> On Sun, 11 Apr 2021 at 15:26, Cristóbal Navarro < >>>>> cristobal.navarr...@gmail.com> wrote: >>>>> >>>>>> * UoM notice: External email. Be cautious of links, attachments, or >>>>>> impersonation attempts * >>>>>> ------------------------------ >>>>>> Hi Community, >>>>>> These last two days I've been trying to understand what is the cause >>>>>> of the "Unable to allocate resources" error I keep getting when >>>>>> specifying >>>>>> --gres=... in a srun command (or sbatch). It fails with the error >>>>>> ➜ srun --gres=gpu:A100:1 nvidia-smi >>>>>> srun: error: Unable to allocate resources: Requested node >>>>>> configuration is not available >>>>>> >>>>>> log file on the master node (not the compute one) >>>>>> ➜ tail -f /var/log/slurm/slurmctld.log >>>>>> [2021-04-11T01:12:23.270] gres:gpu(7696487) type:(null)(0) job:1317 >>>>>> flags: state >>>>>> [2021-04-11T01:12:23.270] gres_per_node:1 node_cnt:0 >>>>>> [2021-04-11T01:12:23.270] ntasks_per_gres:65534 >>>>>> [2021-04-11T01:12:23.270] select/cons_res: common_job_test: no >>>>>> job_resources info for JobId=1317 rc=-1 >>>>>> [2021-04-11T01:12:23.270] select/cons_res: common_job_test: no >>>>>> job_resources info for JobId=1317 rc=-1 >>>>>> [2021-04-11T01:12:23.270] select/cons_res: common_job_test: no >>>>>> job_resources info for JobId=1317 rc=-1 >>>>>> [2021-04-11T01:12:23.271] _pick_best_nodes: JobId=1317 never runnable >>>>>> in partition gpu >>>>>> [2021-04-11T01:12:23.271] _slurm_rpc_allocate_resources: Requested >>>>>> node configuration is not available >>>>>> >>>>>> If launched without --gres, it allocates all GPUs by default and >>>>>> nvidia-smi does work, in fact our CUDA programs do work via SLURM if >>>>>> --gres >>>>>> is not specified. >>>>>> ➜ TUT04-GPU-multi git:(master) ✗ srun nvidia-smi >>>>>> Sun Apr 11 01:05:47 2021 >>>>>> >>>>>> +-----------------------------------------------------------------------------+ >>>>>> | NVIDIA-SMI 450.102.04 Driver Version: 450.102.04 CUDA Version: >>>>>> 11.0 | >>>>>> >>>>>> |-------------------------------+----------------------+----------------------+ >>>>>> | GPU Name Persistence-M| Bus-Id Disp.A | Volatile >>>>>> Uncorr. ECC | >>>>>> | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util >>>>>> Compute M. | >>>>>> | | | >>>>>> MIG M. | >>>>>> >>>>>> |===============================+======================+======================| >>>>>> | 0 A100-SXM4-40GB On | 00000000:07:00.0 Off | >>>>>> 0 | >>>>>> | N/A 31C P0 51W / 400W | 0MiB / 40537MiB | 0% >>>>>> Default | >>>>>> | | | >>>>>> Disabled | >>>>>> .... >>>>>> .... >>>>>> >>>>>> There is only one DGX A100 Compute node with 8 GPUs and 2x 64-core >>>>>> CPUs, and the gres.conf file simply is (also tried the commented lines): >>>>>> ➜ ~ cat /etc/slurm/gres.conf >>>>>> # GRES configuration for native GPUS >>>>>> # DGX A100 8x Nvidia A100 >>>>>> #AutoDetect=nvml >>>>>> Name=gpu Type=A100 File=/dev/nvidia[0-7] >>>>>> >>>>>> #Name=gpu Type=A100 File=/dev/nvidia0 Cores=0-7 >>>>>> #Name=gpu Type=A100 File=/dev/nvidia1 Cores=8-15 >>>>>> #Name=gpu Type=A100 File=/dev/nvidia2 Cores=16-23 >>>>>> #Name=gpu Type=A100 File=/dev/nvidia3 Cores=24-31 >>>>>> #Name=gpu Type=A100 File=/dev/nvidia4 Cores=32-39 >>>>>> #Name=gpu Type=A100 File=/dev/nvidia5 Cores=40-47 >>>>>> #Name=gpu Type=A100 File=/dev/nvidia6 Cores=48-55 >>>>>> #Name=gpu Type=A100 File=/dev/nvidia7 Cores=56-63 >>>>>> >>>>>> >>>>>> Some relevant parts of the slurm.conf file >>>>>> ➜ cat /etc/slurm/slurm.conf >>>>>> ... >>>>>> ## GRES >>>>>> GresTypes=gpu >>>>>> AccountingStorageTRES=gres/gpu >>>>>> DebugFlags=CPU_Bind,gres >>>>>> ... >>>>>> ## Nodes list >>>>>> ## Default CPU layout, native GPUs >>>>>> NodeName=nodeGPU01 SocketsPerBoard=8 CoresPerSocket=16 >>>>>> ThreadsPerCore=2 RealMemory=1024000 State=UNKNOWN Gres=gpu:A100:8 >>>>>> Feature=ht,gpu >>>>>> ... >>>>>> ## Partitions list >>>>>> PartitionName=gpu OverSubscribe=FORCE MaxCPUsPerNode=128 >>>>>> MaxTime=INFINITE State=UP Nodes=nodeGPU01 Default=YES >>>>>> PartitionName=cpu OverSubscribe=FORCE MaxCPUsPerNode=128 >>>>>> MaxTime=INFINITE State=UP Nodes=nodeGPU01 >>>>>> >>>>>> Any ideas where should I check? >>>>>> thanks in advance >>>>>> -- >>>>>> Cristóbal A. Navarro >>>>>> >>>>> >>>> >>>> -- >>>> Cristóbal A. Navarro >>>> >>> >> >> -- >> Cristóbal A. Navarro >> > > > -- > Cristóbal A. Navarro > -- Cristóbal A. Navarro