Hi Sean, Sorry for the delay, The problem got solved accidentally by restarting the slurm services on the head node. Maybe it was an unfortunate combination of changes done, for which I was assuming "scontrol reconfigure" would apply them all properly.
Anyways, I will follow your advice and try changing to to "cons_tres" plugin Will post back with the result. best and many thanks On Mon, Apr 12, 2021 at 6:35 AM Sean Crosby <scro...@unimelb.edu.au> wrote: > Hi Cristobal, > > The weird stuff I see in your job is > > [2021-04-11T01:12:23.270] gres:gpu(7696487) type:(null)(0) job:1317 flags: > state > [2021-04-11T01:12:23.270] gres_per_node:1 node_cnt:0 > [2021-04-11T01:12:23.270] ntasks_per_gres:65534 > > Not sure why ntasks_per_gres is 65534 and node_cnt is 0. > > Can you try > > srun --gres=gpu:A100:1 --mem=10G --cpus-per-gpu=1 --nodes=1 nvidia-smi > > and post the output of slurmctld.log? > > I also recommend changing from cons_res to cons_tres for SelectType > > e.g. > > SelectType=select/cons_tres > SelectTypeParameters=CR_Core_Memory,CR_ONE_TASK_PER_CORE > > Sean > > -- > Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead > Research Computing Services | Business Services > The University of Melbourne, Victoria 3010 Australia > > > > On Mon, 12 Apr 2021 at 00:18, Cristóbal Navarro < > cristobal.navarr...@gmail.com> wrote: > >> * UoM notice: External email. Be cautious of links, attachments, or >> impersonation attempts * >> ------------------------------ >> Hi Sean, >> Tried as suggested but still getting the same error. >> This is the node configuration visible to 'scontrol' just in case >> ➜ scontrol show node >> NodeName=nodeGPU01 Arch=x86_64 CoresPerSocket=16 >> CPUAlloc=0 CPUTot=256 CPULoad=8.07 >> AvailableFeatures=ht,gpu >> ActiveFeatures=ht,gpu >> Gres=gpu:A100:8 >> NodeAddr=nodeGPU01 NodeHostName=nodeGPU01 Version=20.11.2 >> OS=Linux 5.4.0-66-generic #74-Ubuntu SMP Wed Jan 27 22:54:38 UTC 2021 >> RealMemory=1024000 AllocMem=0 FreeMem=1019774 Sockets=8 Boards=1 >> State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A >> Partitions=gpu,cpu >> BootTime=2021-04-09T21:23:14 SlurmdStartTime=2021-04-11T10:11:12 >> CfgTRES=cpu=256,mem=1000G,billing=256 >> AllocTRES= >> CapWatts=n/a >> CurrentWatts=0 AveWatts=0 >> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s >> Comment=(null) >> >> >> >> >> On Sun, Apr 11, 2021 at 2:03 AM Sean Crosby <scro...@unimelb.edu.au> >> wrote: >> >>> Hi Cristobal, >>> >>> My hunch is it is due to the default memory/CPU settings. >>> >>> Does it work if you do >>> >>> srun --gres=gpu:A100:1 --cpus-per-task=1 --mem=10G nvidia-smi >>> >>> Sean >>> -- >>> Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead >>> Research Computing Services | Business Services >>> The University of Melbourne, Victoria 3010 Australia >>> >>> >>> >>> On Sun, 11 Apr 2021 at 15:26, Cristóbal Navarro < >>> cristobal.navarr...@gmail.com> wrote: >>> >>>> * UoM notice: External email. Be cautious of links, attachments, or >>>> impersonation attempts * >>>> ------------------------------ >>>> Hi Community, >>>> These last two days I've been trying to understand what is the cause of >>>> the "Unable to allocate resources" error I keep getting when specifying >>>> --gres=... in a srun command (or sbatch). It fails with the error >>>> ➜ srun --gres=gpu:A100:1 nvidia-smi >>>> srun: error: Unable to allocate resources: Requested node configuration >>>> is not available >>>> >>>> log file on the master node (not the compute one) >>>> ➜ tail -f /var/log/slurm/slurmctld.log >>>> [2021-04-11T01:12:23.270] gres:gpu(7696487) type:(null)(0) job:1317 >>>> flags: state >>>> [2021-04-11T01:12:23.270] gres_per_node:1 node_cnt:0 >>>> [2021-04-11T01:12:23.270] ntasks_per_gres:65534 >>>> [2021-04-11T01:12:23.270] select/cons_res: common_job_test: no >>>> job_resources info for JobId=1317 rc=-1 >>>> [2021-04-11T01:12:23.270] select/cons_res: common_job_test: no >>>> job_resources info for JobId=1317 rc=-1 >>>> [2021-04-11T01:12:23.270] select/cons_res: common_job_test: no >>>> job_resources info for JobId=1317 rc=-1 >>>> [2021-04-11T01:12:23.271] _pick_best_nodes: JobId=1317 never runnable >>>> in partition gpu >>>> [2021-04-11T01:12:23.271] _slurm_rpc_allocate_resources: Requested node >>>> configuration is not available >>>> >>>> If launched without --gres, it allocates all GPUs by default and >>>> nvidia-smi does work, in fact our CUDA programs do work via SLURM if --gres >>>> is not specified. >>>> ➜ TUT04-GPU-multi git:(master) ✗ srun nvidia-smi >>>> Sun Apr 11 01:05:47 2021 >>>> >>>> +-----------------------------------------------------------------------------+ >>>> | NVIDIA-SMI 450.102.04 Driver Version: 450.102.04 CUDA Version: >>>> 11.0 | >>>> >>>> |-------------------------------+----------------------+----------------------+ >>>> | GPU Name Persistence-M| Bus-Id Disp.A | Volatile >>>> Uncorr. ECC | >>>> | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util >>>> Compute M. | >>>> | | | >>>> MIG M. | >>>> >>>> |===============================+======================+======================| >>>> | 0 A100-SXM4-40GB On | 00000000:07:00.0 Off | >>>> 0 | >>>> | N/A 31C P0 51W / 400W | 0MiB / 40537MiB | 0% >>>> Default | >>>> | | | >>>> Disabled | >>>> .... >>>> .... >>>> >>>> There is only one DGX A100 Compute node with 8 GPUs and 2x 64-core >>>> CPUs, and the gres.conf file simply is (also tried the commented lines): >>>> ➜ ~ cat /etc/slurm/gres.conf >>>> # GRES configuration for native GPUS >>>> # DGX A100 8x Nvidia A100 >>>> #AutoDetect=nvml >>>> Name=gpu Type=A100 File=/dev/nvidia[0-7] >>>> >>>> #Name=gpu Type=A100 File=/dev/nvidia0 Cores=0-7 >>>> #Name=gpu Type=A100 File=/dev/nvidia1 Cores=8-15 >>>> #Name=gpu Type=A100 File=/dev/nvidia2 Cores=16-23 >>>> #Name=gpu Type=A100 File=/dev/nvidia3 Cores=24-31 >>>> #Name=gpu Type=A100 File=/dev/nvidia4 Cores=32-39 >>>> #Name=gpu Type=A100 File=/dev/nvidia5 Cores=40-47 >>>> #Name=gpu Type=A100 File=/dev/nvidia6 Cores=48-55 >>>> #Name=gpu Type=A100 File=/dev/nvidia7 Cores=56-63 >>>> >>>> >>>> Some relevant parts of the slurm.conf file >>>> ➜ cat /etc/slurm/slurm.conf >>>> ... >>>> ## GRES >>>> GresTypes=gpu >>>> AccountingStorageTRES=gres/gpu >>>> DebugFlags=CPU_Bind,gres >>>> ... >>>> ## Nodes list >>>> ## Default CPU layout, native GPUs >>>> NodeName=nodeGPU01 SocketsPerBoard=8 CoresPerSocket=16 ThreadsPerCore=2 >>>> RealMemory=1024000 State=UNKNOWN Gres=gpu:A100:8 Feature=ht,gpu >>>> ... >>>> ## Partitions list >>>> PartitionName=gpu OverSubscribe=FORCE MaxCPUsPerNode=128 >>>> MaxTime=INFINITE State=UP Nodes=nodeGPU01 Default=YES >>>> PartitionName=cpu OverSubscribe=FORCE MaxCPUsPerNode=128 >>>> MaxTime=INFINITE State=UP Nodes=nodeGPU01 >>>> >>>> Any ideas where should I check? >>>> thanks in advance >>>> -- >>>> Cristóbal A. Navarro >>>> >>> >> >> -- >> Cristóbal A. Navarro >> > -- Cristóbal A. Navarro