Hi Cristobal, The weird stuff I see in your job is
[2021-04-11T01:12:23.270] gres:gpu(7696487) type:(null)(0) job:1317 flags: state [2021-04-11T01:12:23.270] gres_per_node:1 node_cnt:0 [2021-04-11T01:12:23.270] ntasks_per_gres:65534 Not sure why ntasks_per_gres is 65534 and node_cnt is 0. Can you try srun --gres=gpu:A100:1 --mem=10G --cpus-per-gpu=1 --nodes=1 nvidia-smi and post the output of slurmctld.log? I also recommend changing from cons_res to cons_tres for SelectType e.g. SelectType=select/cons_tres SelectTypeParameters=CR_Core_Memory,CR_ONE_TASK_PER_CORE Sean -- Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead Research Computing Services | Business Services The University of Melbourne, Victoria 3010 Australia On Mon, 12 Apr 2021 at 00:18, Cristóbal Navarro < cristobal.navarr...@gmail.com> wrote: > * UoM notice: External email. Be cautious of links, attachments, or > impersonation attempts * > ------------------------------ > Hi Sean, > Tried as suggested but still getting the same error. > This is the node configuration visible to 'scontrol' just in case > ➜ scontrol show node > NodeName=nodeGPU01 Arch=x86_64 CoresPerSocket=16 > CPUAlloc=0 CPUTot=256 CPULoad=8.07 > AvailableFeatures=ht,gpu > ActiveFeatures=ht,gpu > Gres=gpu:A100:8 > NodeAddr=nodeGPU01 NodeHostName=nodeGPU01 Version=20.11.2 > OS=Linux 5.4.0-66-generic #74-Ubuntu SMP Wed Jan 27 22:54:38 UTC 2021 > RealMemory=1024000 AllocMem=0 FreeMem=1019774 Sockets=8 Boards=1 > State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A > Partitions=gpu,cpu > BootTime=2021-04-09T21:23:14 SlurmdStartTime=2021-04-11T10:11:12 > CfgTRES=cpu=256,mem=1000G,billing=256 > AllocTRES= > CapWatts=n/a > CurrentWatts=0 AveWatts=0 > ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s > Comment=(null) > > > > > On Sun, Apr 11, 2021 at 2:03 AM Sean Crosby <scro...@unimelb.edu.au> > wrote: > >> Hi Cristobal, >> >> My hunch is it is due to the default memory/CPU settings. >> >> Does it work if you do >> >> srun --gres=gpu:A100:1 --cpus-per-task=1 --mem=10G nvidia-smi >> >> Sean >> -- >> Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead >> Research Computing Services | Business Services >> The University of Melbourne, Victoria 3010 Australia >> >> >> >> On Sun, 11 Apr 2021 at 15:26, Cristóbal Navarro < >> cristobal.navarr...@gmail.com> wrote: >> >>> * UoM notice: External email. Be cautious of links, attachments, or >>> impersonation attempts * >>> ------------------------------ >>> Hi Community, >>> These last two days I've been trying to understand what is the cause of >>> the "Unable to allocate resources" error I keep getting when specifying >>> --gres=... in a srun command (or sbatch). It fails with the error >>> ➜ srun --gres=gpu:A100:1 nvidia-smi >>> srun: error: Unable to allocate resources: Requested node configuration >>> is not available >>> >>> log file on the master node (not the compute one) >>> ➜ tail -f /var/log/slurm/slurmctld.log >>> [2021-04-11T01:12:23.270] gres:gpu(7696487) type:(null)(0) job:1317 >>> flags: state >>> [2021-04-11T01:12:23.270] gres_per_node:1 node_cnt:0 >>> [2021-04-11T01:12:23.270] ntasks_per_gres:65534 >>> [2021-04-11T01:12:23.270] select/cons_res: common_job_test: no >>> job_resources info for JobId=1317 rc=-1 >>> [2021-04-11T01:12:23.270] select/cons_res: common_job_test: no >>> job_resources info for JobId=1317 rc=-1 >>> [2021-04-11T01:12:23.270] select/cons_res: common_job_test: no >>> job_resources info for JobId=1317 rc=-1 >>> [2021-04-11T01:12:23.271] _pick_best_nodes: JobId=1317 never runnable in >>> partition gpu >>> [2021-04-11T01:12:23.271] _slurm_rpc_allocate_resources: Requested node >>> configuration is not available >>> >>> If launched without --gres, it allocates all GPUs by default and >>> nvidia-smi does work, in fact our CUDA programs do work via SLURM if --gres >>> is not specified. >>> ➜ TUT04-GPU-multi git:(master) ✗ srun nvidia-smi >>> Sun Apr 11 01:05:47 2021 >>> >>> +-----------------------------------------------------------------------------+ >>> | NVIDIA-SMI 450.102.04 Driver Version: 450.102.04 CUDA Version: >>> 11.0 | >>> >>> |-------------------------------+----------------------+----------------------+ >>> | GPU Name Persistence-M| Bus-Id Disp.A | Volatile >>> Uncorr. ECC | >>> | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util >>> Compute M. | >>> | | | >>> MIG M. | >>> >>> |===============================+======================+======================| >>> | 0 A100-SXM4-40GB On | 00000000:07:00.0 Off | >>> 0 | >>> | N/A 31C P0 51W / 400W | 0MiB / 40537MiB | 0% >>> Default | >>> | | | >>> Disabled | >>> .... >>> .... >>> >>> There is only one DGX A100 Compute node with 8 GPUs and 2x 64-core CPUs, >>> and the gres.conf file simply is (also tried the commented lines): >>> ➜ ~ cat /etc/slurm/gres.conf >>> # GRES configuration for native GPUS >>> # DGX A100 8x Nvidia A100 >>> #AutoDetect=nvml >>> Name=gpu Type=A100 File=/dev/nvidia[0-7] >>> >>> #Name=gpu Type=A100 File=/dev/nvidia0 Cores=0-7 >>> #Name=gpu Type=A100 File=/dev/nvidia1 Cores=8-15 >>> #Name=gpu Type=A100 File=/dev/nvidia2 Cores=16-23 >>> #Name=gpu Type=A100 File=/dev/nvidia3 Cores=24-31 >>> #Name=gpu Type=A100 File=/dev/nvidia4 Cores=32-39 >>> #Name=gpu Type=A100 File=/dev/nvidia5 Cores=40-47 >>> #Name=gpu Type=A100 File=/dev/nvidia6 Cores=48-55 >>> #Name=gpu Type=A100 File=/dev/nvidia7 Cores=56-63 >>> >>> >>> Some relevant parts of the slurm.conf file >>> ➜ cat /etc/slurm/slurm.conf >>> ... >>> ## GRES >>> GresTypes=gpu >>> AccountingStorageTRES=gres/gpu >>> DebugFlags=CPU_Bind,gres >>> ... >>> ## Nodes list >>> ## Default CPU layout, native GPUs >>> NodeName=nodeGPU01 SocketsPerBoard=8 CoresPerSocket=16 ThreadsPerCore=2 >>> RealMemory=1024000 State=UNKNOWN Gres=gpu:A100:8 Feature=ht,gpu >>> ... >>> ## Partitions list >>> PartitionName=gpu OverSubscribe=FORCE MaxCPUsPerNode=128 >>> MaxTime=INFINITE State=UP Nodes=nodeGPU01 Default=YES >>> PartitionName=cpu OverSubscribe=FORCE MaxCPUsPerNode=128 >>> MaxTime=INFINITE State=UP Nodes=nodeGPU01 >>> >>> Any ideas where should I check? >>> thanks in advance >>> -- >>> Cristóbal A. Navarro >>> >> > > -- > Cristóbal A. Navarro >