Hi Sean,
Sorry for the delay,
The problem got solved accidentally by restarting the slurm services on the
head node.
Maybe it was an unfortunate combination of changes done, for which I was
assuming "scontrol reconfigure" would apply them all properly.

Anyways, I will follow your advice and try changing to to "cons_tres" plugin
Will post back with the result.
best and many thanks

On Mon, Apr 12, 2021 at 6:35 AM Sean Crosby <scro...@unimelb.edu.au> wrote:

> Hi Cristobal,
>
> The weird stuff I see in your job is
>
> [2021-04-11T01:12:23.270] gres:gpu(7696487) type:(null)(0) job:1317 flags:
> state
> [2021-04-11T01:12:23.270]   gres_per_node:1 node_cnt:0
> [2021-04-11T01:12:23.270]   ntasks_per_gres:65534
>
> Not sure why ntasks_per_gres is 65534 and node_cnt is 0.
>
> Can you try
>
> srun --gres=gpu:A100:1 --mem=10G --cpus-per-gpu=1 --nodes=1 nvidia-smi
>
> and post the output of slurmctld.log?
>
> I also recommend changing from cons_res to cons_tres for SelectType
>
> e.g.
>
> SelectType=select/cons_tres
> SelectTypeParameters=CR_Core_Memory,CR_ONE_TASK_PER_CORE
>
> Sean
>
> --
> Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
> Research Computing Services | Business Services
> The University of Melbourne, Victoria 3010 Australia
>
>
>
> On Mon, 12 Apr 2021 at 00:18, Cristóbal Navarro <
> cristobal.navarr...@gmail.com> wrote:
>
>> * UoM notice: External email. Be cautious of links, attachments, or
>> impersonation attempts *
>> ------------------------------
>> Hi Sean,
>> Tried as suggested but still getting the same error.
>> This is the node configuration visible to 'scontrol' just in case
>> ➜  scontrol show node
>> NodeName=nodeGPU01 Arch=x86_64 CoresPerSocket=16
>>    CPUAlloc=0 CPUTot=256 CPULoad=8.07
>>    AvailableFeatures=ht,gpu
>>    ActiveFeatures=ht,gpu
>>    Gres=gpu:A100:8
>>    NodeAddr=nodeGPU01 NodeHostName=nodeGPU01 Version=20.11.2
>>    OS=Linux 5.4.0-66-generic #74-Ubuntu SMP Wed Jan 27 22:54:38 UTC 2021
>>    RealMemory=1024000 AllocMem=0 FreeMem=1019774 Sockets=8 Boards=1
>>    State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
>>    Partitions=gpu,cpu
>>    BootTime=2021-04-09T21:23:14 SlurmdStartTime=2021-04-11T10:11:12
>>    CfgTRES=cpu=256,mem=1000G,billing=256
>>    AllocTRES=
>>    CapWatts=n/a
>>    CurrentWatts=0 AveWatts=0
>>    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>>    Comment=(null)
>>
>>
>>
>>
>> On Sun, Apr 11, 2021 at 2:03 AM Sean Crosby <scro...@unimelb.edu.au>
>> wrote:
>>
>>> Hi Cristobal,
>>>
>>> My hunch is it is due to the default memory/CPU settings.
>>>
>>> Does it work if you do
>>>
>>> srun --gres=gpu:A100:1 --cpus-per-task=1 --mem=10G nvidia-smi
>>>
>>> Sean
>>> --
>>> Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
>>> Research Computing Services | Business Services
>>> The University of Melbourne, Victoria 3010 Australia
>>>
>>>
>>>
>>> On Sun, 11 Apr 2021 at 15:26, Cristóbal Navarro <
>>> cristobal.navarr...@gmail.com> wrote:
>>>
>>>> * UoM notice: External email. Be cautious of links, attachments, or
>>>> impersonation attempts *
>>>> ------------------------------
>>>> Hi Community,
>>>> These last two days I've been trying to understand what is the cause of
>>>> the "Unable to allocate resources" error I keep getting when specifying
>>>> --gres=...  in a srun command (or sbatch). It fails with the error
>>>> ➜  srun --gres=gpu:A100:1 nvidia-smi
>>>> srun: error: Unable to allocate resources: Requested node configuration
>>>> is not available
>>>>
>>>> log file on the master node (not the compute one)
>>>> ➜  tail -f /var/log/slurm/slurmctld.log
>>>> [2021-04-11T01:12:23.270] gres:gpu(7696487) type:(null)(0) job:1317
>>>> flags: state
>>>> [2021-04-11T01:12:23.270]   gres_per_node:1 node_cnt:0
>>>> [2021-04-11T01:12:23.270]   ntasks_per_gres:65534
>>>> [2021-04-11T01:12:23.270] select/cons_res: common_job_test: no
>>>> job_resources info for JobId=1317 rc=-1
>>>> [2021-04-11T01:12:23.270] select/cons_res: common_job_test: no
>>>> job_resources info for JobId=1317 rc=-1
>>>> [2021-04-11T01:12:23.270] select/cons_res: common_job_test: no
>>>> job_resources info for JobId=1317 rc=-1
>>>> [2021-04-11T01:12:23.271] _pick_best_nodes: JobId=1317 never runnable
>>>> in partition gpu
>>>> [2021-04-11T01:12:23.271] _slurm_rpc_allocate_resources: Requested node
>>>> configuration is not available
>>>>
>>>> If launched without --gres, it allocates all GPUs by default and
>>>> nvidia-smi does work, in fact our CUDA programs do work via SLURM if --gres
>>>> is not specified.
>>>> ➜  TUT04-GPU-multi git:(master) ✗ srun nvidia-smi
>>>> Sun Apr 11 01:05:47 2021
>>>>
>>>> +-----------------------------------------------------------------------------+
>>>> | NVIDIA-SMI 450.102.04   Driver Version: 450.102.04   CUDA Version:
>>>> 11.0     |
>>>>
>>>> |-------------------------------+----------------------+----------------------+
>>>> | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile
>>>> Uncorr. ECC |
>>>> | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util
>>>>  Compute M. |
>>>> |                               |                      |
>>>> MIG M. |
>>>>
>>>> |===============================+======================+======================|
>>>> |   0  A100-SXM4-40GB      On   | 00000000:07:00.0 Off |
>>>>      0 |
>>>> | N/A   31C    P0    51W / 400W |      0MiB / 40537MiB |      0%
>>>>  Default |
>>>> |                               |                      |
>>>> Disabled |
>>>> ....
>>>> ....
>>>>
>>>> There is only one DGX A100 Compute node with 8 GPUs and 2x 64-core
>>>> CPUs, and the gres.conf file simply is (also tried the commented lines):
>>>> ➜  ~ cat /etc/slurm/gres.conf
>>>> # GRES configuration for native GPUS
>>>> # DGX A100 8x Nvidia A100
>>>> #AutoDetect=nvml
>>>> Name=gpu Type=A100 File=/dev/nvidia[0-7]
>>>>
>>>> #Name=gpu Type=A100 File=/dev/nvidia0 Cores=0-7
>>>> #Name=gpu Type=A100 File=/dev/nvidia1 Cores=8-15
>>>> #Name=gpu Type=A100 File=/dev/nvidia2 Cores=16-23
>>>> #Name=gpu Type=A100 File=/dev/nvidia3 Cores=24-31
>>>> #Name=gpu Type=A100 File=/dev/nvidia4 Cores=32-39
>>>> #Name=gpu Type=A100 File=/dev/nvidia5 Cores=40-47
>>>> #Name=gpu Type=A100 File=/dev/nvidia6 Cores=48-55
>>>> #Name=gpu Type=A100 File=/dev/nvidia7 Cores=56-63
>>>>
>>>>
>>>> Some relevant parts of the slurm.conf file
>>>> ➜  cat /etc/slurm/slurm.conf
>>>> ...
>>>> ## GRES
>>>> GresTypes=gpu
>>>> AccountingStorageTRES=gres/gpu
>>>> DebugFlags=CPU_Bind,gres
>>>> ...
>>>> ## Nodes list
>>>> ## Default CPU layout, native GPUs
>>>> NodeName=nodeGPU01 SocketsPerBoard=8 CoresPerSocket=16 ThreadsPerCore=2
>>>> RealMemory=1024000 State=UNKNOWN Gres=gpu:A100:8 Feature=ht,gpu
>>>> ...
>>>> ## Partitions list
>>>> PartitionName=gpu OverSubscribe=FORCE MaxCPUsPerNode=128
>>>> MaxTime=INFINITE State=UP Nodes=nodeGPU01  Default=YES
>>>> PartitionName=cpu OverSubscribe=FORCE MaxCPUsPerNode=128
>>>> MaxTime=INFINITE State=UP Nodes=nodeGPU01
>>>>
>>>> Any ideas where should I check?
>>>> thanks in advance
>>>> --
>>>> Cristóbal A. Navarro
>>>>
>>>
>>
>> --
>> Cristóbal A. Navarro
>>
>

-- 
Cristóbal A. Navarro

Reply via email to