[slurm-users] Re: Invalid generic resource (gres) specification after RMA

Lee via slurm-users Wed, 26 Nov 2025 08:22:58 -0800

Hello,

1. Output from `scontrol show node=dgx09`
user@l01:~$ scontrol show node=dgx09
NodeName=dgx09 Arch=x86_64 CoresPerSocket=56
   CPUAlloc=0 CPUEfctv=224 CPUTot=224 CPULoad=0.98
   AvailableFeatures=location=local
   ActiveFeatures=location=local
   Gres=gpu:h100:8(S:0-1)
   NodeAddr=dgx09 NodeHostName=dgx09 Version=23.02.6
   OS=Linux 5.15.0-1042-nvidia #42-Ubuntu SMP Wed Nov 15 20:28:30 UTC 2023
   RealMemory=2063937 AllocMem=0 FreeMem=2033902 Sockets=2 Boards=1
   MemSpecLimit=30017
   State=IDLE+RESERVED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A
MCS_label=N/A
   Partitions=defq
   BootTime=2025-11-04T13:57:26 SlurmdStartTime=2025-11-05T15:40:46
   LastBusyTime=2025-11-25T13:07:36 ResumeAfterTime=None
   CfgTRES=cpu=224,mem=2063937M,billing=448,gres/gpu=8
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   ReservationName=g09_test



2. I don't see any errors in slurmctld related to dgx09, when I submit a
job :

user@l01:~$ srun --reservation=g09_test --nodelist=dgx09 --gres=gpu:2 --pty
bash
srun: error: Unable to create step for job 108596: Invalid generic resource
(gres) specification

slurmctld shows :
[2025-11-26T10:57:42.592] sched: _slurm_rpc_allocate_resources JobId=108596
NodeList=dgx09 usec=1495
[2025-11-26T10:57:42.695] _job_complete: JobId=108596 WTERMSIG 1
[2025-11-26T10:57:42.695] _job_complete: JobId=108596 done

3. Grep'ing for jobid and for errors on dgx09:/var/log/slurmd returns
nothing, i.e.
root@dgx09:~# grep -i error /var/log/slurmd.     # no output
root@dgx09:~# grep -i 108596 /var/log/slurmd  # no output

Looking at journalctl :
root@dgx09:~# journalctl -fu slurmd.service
Nov 26 10:57:33 dgx09 slurmd[1751949]: slurmd: Resource spec: system cgroup
memory limit set to 30017 MBNov 26 10:57:34 dgx09 slurmd[1751949]: slurmd:
gpu/nvml: _get_system_gpu_list_nvml: 8 GPU system device(s) detected

Best,
Lee

On Tue, Nov 25, 2025 at 1:24 PM Russell Jones via slurm-users <
[email protected]> wrote:

> Can you give the output of "scontrol show node dgx09" ?
>
> Are there any errors in your slurmctld.log?
>
> Are there any errors in slurmd.log on dgx09 node?
>
> On Tue, Nov 25, 2025 at 12:13 PM Lee <[email protected]> wrote:
>
>> Hello,
>>
>> @Russel - good catch.  No, I'm not actually missing the square bracket.
>> It got lost during the copy/paste.  I'll restate it below for clarity :
>> 2. grep NodeName slurm.conf
>> root@h01:# grep NodeName slurm.conf
>> NodeName=dgx[01,02] RealMemory=2063937 Boards=1 SocketsPerBoard=2
>> CoresPerSocket=56 ThreadsPerCore=2 MemSpecLimit=30017 Gres=gpu:1g.20gb:32
>> Feature=location=local
>> NodeName=dgx*[*03-10] RealMemory=2063937 Boards=1 SocketsPerBoard=2
>> CoresPerSocket=56 ThreadsPerCore=2 MemSpecLimit=30017 Gres=gpu:H100:8
>> Feature=location=local
>>
>> @Keshav : It still doesn't work
>> user@l01:~$ srun --reservation=g09_test --nodelist=dgx09
>> --gres=gpu:h100:2 --pty bash
>> srun: error: Unable to create step for job 107044: Invalid generic
>> resource (gres) specification
>>
>> Best,
>> Lee
>>
>>
>> On Tue, Nov 25, 2025 at 12:49 PM Russell Jones via slurm-users <
>> [email protected]> wrote:
>>
>>> > NodeName=dgx03-10] RealMemory=2063937 Boards=1 SocketsPerBoard=2
>>> CoresPerSocket=56 ThreadsPerCore=2 MemSpecLimit=30017 Gres=gpu:H100:8
>>> Feature=location=local
>>>
>>> Just in case, that line shows you are missing a bracket in the node
>>> name. Are you *actually* missing the bracket?
>>>
>>>
>>> On Tue, Nov 25, 2025 at 9:11 AM Lee via slurm-users <
>>> [email protected]> wrote:
>>>
>>>> Hello,
>>>>
>>>> Sorry for the delayed response, SC25 interfered with my schedule.
>>>>
>>>> *Answers* :
>>>> 1. Yes, dgx09 and all the others boot the same software images.
>>>>
>>>> 2. dgx09 and the other nodes mount a shared file system where Slurm is
>>>> installed, so /cm/shared/apps/slurm/23.02.6/lib64/slurm/gpu_nvml.so is
>>>> the same for every node.  I assume the library that is used for
>>>> autodetection lives there.  I also found a shared library 
>>>> /usr/lib/x86_64-linux-gnu/libnvml_injection.so.1.0
>>>> (within the software image).  I checked the md5sum and it is the same on
>>>> both dgx09 and a non-broken node.
>>>>
>>>> 3. `scontrol show config` is the same on dgx09 and a non-broken DGX.
>>>>
>>>> 4. The only meaningful difference between `scontrol show node` for
>>>> dgx09 and dgx08 (a working node) is :
>>>>
>>>> <    Gres=gpu:*h100*:8(S:0-1)
>>>> ---
>>>> >    Gres=gpu:*H100*:8(S:0-1)
>>>>
>>>> 5. Yes, we've restarted slurmd and slurmctld several times, the
>>>> behavior persists.  Of note, when I run `scontrol reconfigure`, the phantom
>>>> allocated GPUs (see AllocTRES in original post) are cleared.
>>>>
>>>>
>>>> *Important Update :*
>>>> 1. We recently had another GPU tray replaced and now that DGX is
>>>> experiencing the same behavior as dgx09.  I am more convinced that there is
>>>> something subtle with how the hardware is being detected by Slurm.
>>>>
>>>> Best regards,
>>>> Lee
>>>>
>>>>
>>>> On Mon, Nov 17, 2025 at 4:06 PM Timony, Mick <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi Lee,
>>>>>
>>>>> I manage a BCM cluster as well. Does DGX09 have the same disk image
>>>>> and libraries in place? Could the NVidia NVML library, used to auto-detect
>>>>> the GPU's, be a diff version and causing the case differences?
>>>>>
>>>>> If you compare the output of scontrol show node dgx09 and another DGX
>>>>> node, do they look the same? Does scontrol show config look different
>>>>> on DGX09 and other nodes?
>>>>>
>>>>> Have you restarted the Slurm controllers (slurmctld) and restarted
>>>>> slurmd on the compute nodes?
>>>>>
>>>>> Kind regards
>>>>>
>>>>> --
>>>>> Mick Timony
>>>>> Senior DevOps Engineer
>>>>> LASER, Longwood, & O2 Cluster Admin
>>>>> Harvard Medical School
>>>>> --
>>>>> ------------------------------
>>>>> *From:* Lee via slurm-users <[email protected]>
>>>>> *Sent:* Friday, November 14, 2025 7:17 AM
>>>>> *To:* John Hearns <[email protected]>
>>>>> *Cc:* [email protected] <[email protected]>
>>>>> *Subject:* [slurm-users] Re: Invalid generic resource (gres)
>>>>> specification after RMA
>>>>>
>>>>> Hello,
>>>>>
>>>>> Thank you for the suggestion.
>>>>>
>>>>> I ran lspci on dgx09 and a working DGX and the output was identical.
>>>>>
>>>>> nvidia-smi shows all 8 GPUs and looks the same as the output from a
>>>>> working DGX :
>>>>>
>>>>> root@dgx09:~# nvidia-smi
>>>>> Fri Nov 14 07:11:05 2025
>>>>>
>>>>> +---------------------------------------------------------------------------------------+
>>>>> | NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA
>>>>> Version: 12.2     |
>>>>>
>>>>> |-----------------------------------------+----------------------+----------------------+
>>>>> | GPU  Name                 Persistence-M | Bus-Id        Disp.A |
>>>>> Volatile Uncorr. ECC |
>>>>> | Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage |
>>>>> GPU-Util  Compute M. |
>>>>> |                                         |                      |
>>>>>           MIG M. |
>>>>>
>>>>> |=========================================+======================+======================|
>>>>> |   0  NVIDIA H100 80GB HBM3          On  | 00000000:1B:00.0 Off |
>>>>>                0 |
>>>>> | N/A   29C    P0              69W / 700W |      4MiB / 81559MiB |
>>>>>  0%      Default |
>>>>> |                                         |                      |
>>>>>         Disabled |
>>>>>
>>>>> +-----------------------------------------+----------------------+----------------------+
>>>>> |   1  NVIDIA H100 80GB HBM3          On  | 00000000:43:00.0 Off |
>>>>>                0 |
>>>>> | N/A   30C    P0              71W / 700W |      4MiB / 81559MiB |
>>>>>  0%      Default |
>>>>> |                                         |                      |
>>>>>         Disabled |
>>>>>
>>>>> +-----------------------------------------+----------------------+----------------------+
>>>>> |   2  NVIDIA H100 80GB HBM3          On  | 00000000:52:00.0 Off |
>>>>>                0 |
>>>>> | N/A   33C    P0              71W / 700W |      4MiB / 81559MiB |
>>>>>  0%      Default |
>>>>> |                                         |                      |
>>>>>         Disabled |
>>>>>
>>>>> +-----------------------------------------+----------------------+----------------------+
>>>>> |   3  NVIDIA H100 80GB HBM3          On  | 00000000:61:00.0 Off |
>>>>>                0 |
>>>>> | N/A   31C    P0              73W / 700W |      4MiB / 81559MiB |
>>>>>  0%      Default |
>>>>> |                                         |                      |
>>>>>         Disabled |
>>>>>
>>>>> +-----------------------------------------+----------------------+----------------------+
>>>>> |   4  NVIDIA H100 80GB HBM3          On  | 00000000:9D:00.0 Off |
>>>>>                0 |
>>>>> | N/A   29C    P0              68W / 700W |      4MiB / 81559MiB |
>>>>>  0%      Default |
>>>>> |                                         |                      |
>>>>>         Disabled |
>>>>>
>>>>> +-----------------------------------------+----------------------+----------------------+
>>>>> |   5  NVIDIA H100 80GB HBM3          On  | 00000000:C3:00.0 Off |
>>>>>                0 |
>>>>> | N/A   28C    P0              69W / 700W |      4MiB / 81559MiB |
>>>>>  0%      Default |
>>>>> |                                         |                      |
>>>>>         Disabled |
>>>>>
>>>>> +-----------------------------------------+----------------------+----------------------+
>>>>> |   6  NVIDIA H100 80GB HBM3          On  | 00000000:D1:00.0 Off |
>>>>>                0 |
>>>>> | N/A   30C    P0              70W / 700W |      4MiB / 81559MiB |
>>>>>  0%      Default |
>>>>> |                                         |                      |
>>>>>         Disabled |
>>>>>
>>>>> +-----------------------------------------+----------------------+----------------------+
>>>>> |   7  NVIDIA H100 80GB HBM3          On  | 00000000:DF:00.0 Off |
>>>>>                0 |
>>>>> | N/A   32C    P0              69W / 700W |      4MiB / 81559MiB |
>>>>>  0%      Default |
>>>>> |                                         |                      |
>>>>>         Disabled |
>>>>>
>>>>> +-----------------------------------------+----------------------+----------------------+
>>>>>
>>>>>
>>>>> +---------------------------------------------------------------------------------------+
>>>>> | Processes:
>>>>>                  |
>>>>> |  GPU   GI   CI        PID   Type   Process name
>>>>>        GPU Memory |
>>>>> |        ID   ID
>>>>>       Usage      |
>>>>>
>>>>> |=======================================================================================|
>>>>> |  No running processes found
>>>>>                   |
>>>>>
>>>>> +---------------------------------------------------------------------------------------+
>>>>>
>>>>>
>>>>> Best regards,
>>>>> Lee
>>>>>
>>>>> On Fri, Nov 14, 2025 at 3:53 AM John Hearns <[email protected]> wrote:
>>>>>
>>>>> I work for AMD...
>>>>> diagnostics I woud run are    lspci     nvidia-smi
>>>>>
>>>>> On Thu, 13 Nov 2025 at 19:18, Lee via slurm-users <
>>>>> [email protected]> wrote:
>>>>>
>>>>> Good afternoon,
>>>>>
>>>>> I have a cluster that is managed by Base Command Manager (v10) and it
>>>>> has several Nvidia DGXs.  dgx09 is a problem child.  The entire node was
>>>>> RMA'd and now it no longer behaves the same as my other DGXs.  I think the
>>>>> below symptoms are caused by a single underlying issue.
>>>>>
>>>>> *Symptoms : *
>>>>> 1. When I look at our 8 non-MIG DGXs, via `scontrol show node=dgxXY |
>>>>> grep Gres`, 7/8 DGXs report "Gres=gpu:*H100*:8(S:0-1)" while dgx09
>>>>> reports "Gres=gpu:*h100*:8(S:0-1)"
>>>>>
>>>>> 2. When I submit a job to this node, I get :
>>>>>
>>>>> $ srun --reservation=g09_test --gres=gpu:2 --pty bash
>>>>> srun: error: Unable to create step for job 105035: Invalid generic
>>>>> resource (gres) specification
>>>>>
>>>>> ### No job is running on the node, yet AllocTRES shows consumed
>>>>> resources...
>>>>> $ scontrol show node=dgx09 | grep -i AllocTRES
>>>>>    *AllocTRES=gres/gpu=2*
>>>>>
>>>>> ### dgx09 : /var/log/slurmd contains no information
>>>>> ### slurmctld shows :
>>>>> root@h01:# grep 105035 /var/log/slurmctld
>>>>> [2025-11-13T07:44:56.380] sched: _slurm_rpc_allocate_resources
>>>>> JobId=105035 NodeList=dgx09 usec=3420
>>>>> [2025-11-13T07:44:56.482] _job_complete: JobId=105035 WTERMSIG 1
>>>>> [2025-11-13T07:44:56.483] _job_complete: JobId=105035 done
>>>>>
>>>>>
>>>>> *Configuration : *
>>>>> 1. gres.conf :
>>>>> # This section of this file was automatically generated by cmd. Do not
>>>>> edit manually!
>>>>> # BEGIN AUTOGENERATED SECTION -- DO NOT REMOVE
>>>>> AutoDetect=NVML
>>>>> NodeName=dgx[01,02] Name=gpu Type=1g.20gb Count=32 AutoDetect=NVML
>>>>> NodeName=dgx[03-10] Name=gpu Type=h100 Count=8 AutoDetect=NVML
>>>>> # END AUTOGENERATED SECTION   -- DO NOT REMOVE
>>>>>
>>>>> 2. grep NodeName slurm.conf
>>>>> root@h01:# grep NodeName slurm.conf
>>>>> NodeName=dgx[01,02] RealMemory=2063937 Boards=1 SocketsPerBoard=2
>>>>> CoresPerSocket=56 ThreadsPerCore=2 MemSpecLimit=30017 Gres=gpu:1g.20gb:32
>>>>> Feature=location=local
>>>>> NodeName=dgx03-10] RealMemory=2063937 Boards=1 SocketsPerBoard=2
>>>>> CoresPerSocket=56 ThreadsPerCore=2 MemSpecLimit=30017 Gres=gpu:H100:8
>>>>> Feature=location=local
>>>>>
>>>>> 3. What slurmd detects on dgx09
>>>>>
>>>>> root@dgx09:~# slurmd -C
>>>>> NodeName=dgx09 CPUs=224 Boards=1 SocketsPerBoard=2 CoresPerSocket=56
>>>>> ThreadsPerCore=2 RealMemory=2063937
>>>>> UpTime=8-00:39:10
>>>>>
>>>>> root@dgx09:~# slurmd -G
>>>>> slurmd: gpu/nvml: _get_system_gpu_list_nvml: 8 GPU system device(s)
>>>>> detected
>>>>> slurmd: Gres Name=gpu Type=h100 Count=1 Index=0 ID=7696487
>>>>> File=/dev/nvidia0 Cores=0-55 CoreCnt=224 Links=-1,0,0,0,0,0,0,0
>>>>> Flags=HAS_FILE,HAS_TYPE,ENV_NVML
>>>>> slurmd: Gres Name=gpu Type=h100 Count=1 Index=1 ID=7696487
>>>>> File=/dev/nvidia1 Cores=0-55 CoreCnt=224 Links=0,-1,0,0,0,0,0,0
>>>>> Flags=HAS_FILE,HAS_TYPE,ENV_NVML
>>>>> slurmd: Gres Name=gpu Type=h100 Count=1 Index=2 ID=7696487
>>>>> File=/dev/nvidia2 Cores=0-55 CoreCnt=224 Links=0,0,-1,0,0,0,0,0
>>>>> Flags=HAS_FILE,HAS_TYPE,ENV_NVML
>>>>> slurmd: Gres Name=gpu Type=h100 Count=1 Index=3 ID=7696487
>>>>> File=/dev/nvidia3 Cores=0-55 CoreCnt=224 Links=0,0,0,-1,0,0,0,0
>>>>> Flags=HAS_FILE,HAS_TYPE,ENV_NVML
>>>>> slurmd: Gres Name=gpu Type=h100 Count=1 Index=4 ID=7696487
>>>>> File=/dev/nvidia4 Cores=56-111 CoreCnt=224 Links=0,0,0,0,-1,0,0,0
>>>>> Flags=HAS_FILE,HAS_TYPE,ENV_NVML
>>>>> slurmd: Gres Name=gpu Type=h100 Count=1 Index=5 ID=7696487
>>>>> File=/dev/nvidia5 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,-1,0,0
>>>>> Flags=HAS_FILE,HAS_TYPE,ENV_NVML
>>>>> slurmd: Gres Name=gpu Type=h100 Count=1 Index=6 ID=7696487
>>>>> File=/dev/nvidia6 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,0,-1,0
>>>>> Flags=HAS_FILE,HAS_TYPE,ENV_NVML
>>>>> slurmd: Gres Name=gpu Type=h100 Count=1 Index=7 ID=7696487
>>>>> File=/dev/nvidia7 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,0,0,-1
>>>>> Flags=HAS_FILE,HAS_TYPE,ENV_NVML
>>>>>
>>>>>
>>>>> *Questions : *
>>>>> 1. As far as I can tell, dgx09 is identical to all my non-MIG DGX
>>>>> nodes in terms of configuration and hardware.  Why does scontrol report it
>>>>> having 'h100' with a lower case 'h' unlike the other dgxs which report 
>>>>> with
>>>>> an upper case 'H'?
>>>>>
>>>>> 2. Why is dgx09 not accepting GPU jobs and afterwards it artificially
>>>>> thinks that there are GPUs allocated even though no jobs are on the node?
>>>>>
>>>>> 3. Are there additional tests / configurations that I can do to probe
>>>>> the differences between dgx09 and all my other nodes?
>>>>>
>>>>> Best regards,
>>>>> Lee
>>>>>
>>>>> --
>>>>> slurm-users mailing list -- [email protected]
>>>>> To unsubscribe send an email to [email protected]
>>>>>
>>>>>
>>>> --
>>>> slurm-users mailing list -- [email protected]
>>>> To unsubscribe send an email to [email protected]
>>>>
>>>
>>> --
>>> slurm-users mailing list -- [email protected]
>>> To unsubscribe send an email to [email protected]
>>>
>>
> --
> slurm-users mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
>

-- 
slurm-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[slurm-users] Re: Invalid generic resource (gres) specification after RMA

Reply via email to