Hi Cristobal, My hunch is it is due to the default memory/CPU settings.
Does it work if you do srun --gres=gpu:A100:1 --cpus-per-task=1 --mem=10G nvidia-smi Sean -- Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead Research Computing Services | Business Services The University of Melbourne, Victoria 3010 Australia On Sun, 11 Apr 2021 at 15:26, Cristóbal Navarro < cristobal.navarr...@gmail.com> wrote: > * UoM notice: External email. Be cautious of links, attachments, or > impersonation attempts * > ------------------------------ > Hi Community, > These last two days I've been trying to understand what is the cause of > the "Unable to allocate resources" error I keep getting when specifying > --gres=... in a srun command (or sbatch). It fails with the error > ➜ srun --gres=gpu:A100:1 nvidia-smi > srun: error: Unable to allocate resources: Requested node configuration is > not available > > log file on the master node (not the compute one) > ➜ tail -f /var/log/slurm/slurmctld.log > [2021-04-11T01:12:23.270] gres:gpu(7696487) type:(null)(0) job:1317 flags: > state > [2021-04-11T01:12:23.270] gres_per_node:1 node_cnt:0 > [2021-04-11T01:12:23.270] ntasks_per_gres:65534 > [2021-04-11T01:12:23.270] select/cons_res: common_job_test: no > job_resources info for JobId=1317 rc=-1 > [2021-04-11T01:12:23.270] select/cons_res: common_job_test: no > job_resources info for JobId=1317 rc=-1 > [2021-04-11T01:12:23.270] select/cons_res: common_job_test: no > job_resources info for JobId=1317 rc=-1 > [2021-04-11T01:12:23.271] _pick_best_nodes: JobId=1317 never runnable in > partition gpu > [2021-04-11T01:12:23.271] _slurm_rpc_allocate_resources: Requested node > configuration is not available > > If launched without --gres, it allocates all GPUs by default and > nvidia-smi does work, in fact our CUDA programs do work via SLURM if --gres > is not specified. > ➜ TUT04-GPU-multi git:(master) ✗ srun nvidia-smi > Sun Apr 11 01:05:47 2021 > > +-----------------------------------------------------------------------------+ > | NVIDIA-SMI 450.102.04 Driver Version: 450.102.04 CUDA Version: 11.0 > | > > |-------------------------------+----------------------+----------------------+ > | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. > ECC | > | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute > M. | > | | | MIG > M. | > > |===============================+======================+======================| > | 0 A100-SXM4-40GB On | 00000000:07:00.0 Off | > 0 | > | N/A 31C P0 51W / 400W | 0MiB / 40537MiB | 0% > Default | > | | | > Disabled | > .... > .... > > There is only one DGX A100 Compute node with 8 GPUs and 2x 64-core CPUs, > and the gres.conf file simply is (also tried the commented lines): > ➜ ~ cat /etc/slurm/gres.conf > # GRES configuration for native GPUS > # DGX A100 8x Nvidia A100 > #AutoDetect=nvml > Name=gpu Type=A100 File=/dev/nvidia[0-7] > > #Name=gpu Type=A100 File=/dev/nvidia0 Cores=0-7 > #Name=gpu Type=A100 File=/dev/nvidia1 Cores=8-15 > #Name=gpu Type=A100 File=/dev/nvidia2 Cores=16-23 > #Name=gpu Type=A100 File=/dev/nvidia3 Cores=24-31 > #Name=gpu Type=A100 File=/dev/nvidia4 Cores=32-39 > #Name=gpu Type=A100 File=/dev/nvidia5 Cores=40-47 > #Name=gpu Type=A100 File=/dev/nvidia6 Cores=48-55 > #Name=gpu Type=A100 File=/dev/nvidia7 Cores=56-63 > > > Some relevant parts of the slurm.conf file > ➜ cat /etc/slurm/slurm.conf > ... > ## GRES > GresTypes=gpu > AccountingStorageTRES=gres/gpu > DebugFlags=CPU_Bind,gres > ... > ## Nodes list > ## Default CPU layout, native GPUs > NodeName=nodeGPU01 SocketsPerBoard=8 CoresPerSocket=16 ThreadsPerCore=2 > RealMemory=1024000 State=UNKNOWN Gres=gpu:A100:8 Feature=ht,gpu > ... > ## Partitions list > PartitionName=gpu OverSubscribe=FORCE MaxCPUsPerNode=128 MaxTime=INFINITE > State=UP Nodes=nodeGPU01 Default=YES > PartitionName=cpu OverSubscribe=FORCE MaxCPUsPerNode=128 MaxTime=INFINITE > State=UP Nodes=nodeGPU01 > > Any ideas where should I check? > thanks in advance > -- > Cristóbal A. Navarro >