Can you give the output of "scontrol show node dgx09" ?

Are there any errors in your slurmctld.log?

Are there any errors in slurmd.log on dgx09 node?

On Tue, Nov 25, 2025 at 12:13 PM Lee <[email protected]> wrote:

> Hello,
>
> @Russel - good catch.  No, I'm not actually missing the square bracket.
> It got lost during the copy/paste.  I'll restate it below for clarity :
> 2. grep NodeName slurm.conf
> root@h01:# grep NodeName slurm.conf
> NodeName=dgx[01,02] RealMemory=2063937 Boards=1 SocketsPerBoard=2
> CoresPerSocket=56 ThreadsPerCore=2 MemSpecLimit=30017 Gres=gpu:1g.20gb:32
> Feature=location=local
> NodeName=dgx*[*03-10] RealMemory=2063937 Boards=1 SocketsPerBoard=2
> CoresPerSocket=56 ThreadsPerCore=2 MemSpecLimit=30017 Gres=gpu:H100:8
> Feature=location=local
>
> @Keshav : It still doesn't work
> user@l01:~$ srun --reservation=g09_test --nodelist=dgx09
> --gres=gpu:h100:2 --pty bash
> srun: error: Unable to create step for job 107044: Invalid generic
> resource (gres) specification
>
> Best,
> Lee
>
>
> On Tue, Nov 25, 2025 at 12:49 PM Russell Jones via slurm-users <
> [email protected]> wrote:
>
>> > NodeName=dgx03-10] RealMemory=2063937 Boards=1 SocketsPerBoard=2
>> CoresPerSocket=56 ThreadsPerCore=2 MemSpecLimit=30017 Gres=gpu:H100:8
>> Feature=location=local
>>
>> Just in case, that line shows you are missing a bracket in the node name.
>> Are you *actually* missing the bracket?
>>
>>
>> On Tue, Nov 25, 2025 at 9:11 AM Lee via slurm-users <
>> [email protected]> wrote:
>>
>>> Hello,
>>>
>>> Sorry for the delayed response, SC25 interfered with my schedule.
>>>
>>> *Answers* :
>>> 1. Yes, dgx09 and all the others boot the same software images.
>>>
>>> 2. dgx09 and the other nodes mount a shared file system where Slurm is
>>> installed, so /cm/shared/apps/slurm/23.02.6/lib64/slurm/gpu_nvml.so is
>>> the same for every node.  I assume the library that is used for
>>> autodetection lives there.  I also found a shared library 
>>> /usr/lib/x86_64-linux-gnu/libnvml_injection.so.1.0
>>> (within the software image).  I checked the md5sum and it is the same on
>>> both dgx09 and a non-broken node.
>>>
>>> 3. `scontrol show config` is the same on dgx09 and a non-broken DGX.
>>>
>>> 4. The only meaningful difference between `scontrol show node` for dgx09
>>> and dgx08 (a working node) is :
>>>
>>> <    Gres=gpu:*h100*:8(S:0-1)
>>> ---
>>> >    Gres=gpu:*H100*:8(S:0-1)
>>>
>>> 5. Yes, we've restarted slurmd and slurmctld several times, the behavior
>>> persists.  Of note, when I run `scontrol reconfigure`, the phantom
>>> allocated GPUs (see AllocTRES in original post) are cleared.
>>>
>>>
>>> *Important Update :*
>>> 1. We recently had another GPU tray replaced and now that DGX is
>>> experiencing the same behavior as dgx09.  I am more convinced that there is
>>> something subtle with how the hardware is being detected by Slurm.
>>>
>>> Best regards,
>>> Lee
>>>
>>>
>>> On Mon, Nov 17, 2025 at 4:06 PM Timony, Mick <
>>> [email protected]> wrote:
>>>
>>>> Hi Lee,
>>>>
>>>> I manage a BCM cluster as well. Does DGX09 have the same disk image and
>>>> libraries in place? Could the NVidia NVML library, used to auto-detect the
>>>> GPU's, be a diff version and causing the case differences?
>>>>
>>>> If you compare the output of scontrol show node dgx09 and another DGX
>>>> node, do they look the same? Does scontrol show config look different
>>>> on DGX09 and other nodes?
>>>>
>>>> Have you restarted the Slurm controllers (slurmctld) and restarted
>>>> slurmd on the compute nodes?
>>>>
>>>> Kind regards
>>>>
>>>> --
>>>> Mick Timony
>>>> Senior DevOps Engineer
>>>> LASER, Longwood, & O2 Cluster Admin
>>>> Harvard Medical School
>>>> --
>>>> ------------------------------
>>>> *From:* Lee via slurm-users <[email protected]>
>>>> *Sent:* Friday, November 14, 2025 7:17 AM
>>>> *To:* John Hearns <[email protected]>
>>>> *Cc:* [email protected] <[email protected]>
>>>> *Subject:* [slurm-users] Re: Invalid generic resource (gres)
>>>> specification after RMA
>>>>
>>>> Hello,
>>>>
>>>> Thank you for the suggestion.
>>>>
>>>> I ran lspci on dgx09 and a working DGX and the output was identical.
>>>>
>>>> nvidia-smi shows all 8 GPUs and looks the same as the output from a
>>>> working DGX :
>>>>
>>>> root@dgx09:~# nvidia-smi
>>>> Fri Nov 14 07:11:05 2025
>>>>
>>>> +---------------------------------------------------------------------------------------+
>>>> | NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA
>>>> Version: 12.2     |
>>>>
>>>> |-----------------------------------------+----------------------+----------------------+
>>>> | GPU  Name                 Persistence-M | Bus-Id        Disp.A |
>>>> Volatile Uncorr. ECC |
>>>> | Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage |
>>>> GPU-Util  Compute M. |
>>>> |                                         |                      |
>>>>           MIG M. |
>>>>
>>>> |=========================================+======================+======================|
>>>> |   0  NVIDIA H100 80GB HBM3          On  | 00000000:1B:00.0 Off |
>>>>                0 |
>>>> | N/A   29C    P0              69W / 700W |      4MiB / 81559MiB |
>>>>  0%      Default |
>>>> |                                         |                      |
>>>>         Disabled |
>>>>
>>>> +-----------------------------------------+----------------------+----------------------+
>>>> |   1  NVIDIA H100 80GB HBM3          On  | 00000000:43:00.0 Off |
>>>>                0 |
>>>> | N/A   30C    P0              71W / 700W |      4MiB / 81559MiB |
>>>>  0%      Default |
>>>> |                                         |                      |
>>>>         Disabled |
>>>>
>>>> +-----------------------------------------+----------------------+----------------------+
>>>> |   2  NVIDIA H100 80GB HBM3          On  | 00000000:52:00.0 Off |
>>>>                0 |
>>>> | N/A   33C    P0              71W / 700W |      4MiB / 81559MiB |
>>>>  0%      Default |
>>>> |                                         |                      |
>>>>         Disabled |
>>>>
>>>> +-----------------------------------------+----------------------+----------------------+
>>>> |   3  NVIDIA H100 80GB HBM3          On  | 00000000:61:00.0 Off |
>>>>                0 |
>>>> | N/A   31C    P0              73W / 700W |      4MiB / 81559MiB |
>>>>  0%      Default |
>>>> |                                         |                      |
>>>>         Disabled |
>>>>
>>>> +-----------------------------------------+----------------------+----------------------+
>>>> |   4  NVIDIA H100 80GB HBM3          On  | 00000000:9D:00.0 Off |
>>>>                0 |
>>>> | N/A   29C    P0              68W / 700W |      4MiB / 81559MiB |
>>>>  0%      Default |
>>>> |                                         |                      |
>>>>         Disabled |
>>>>
>>>> +-----------------------------------------+----------------------+----------------------+
>>>> |   5  NVIDIA H100 80GB HBM3          On  | 00000000:C3:00.0 Off |
>>>>                0 |
>>>> | N/A   28C    P0              69W / 700W |      4MiB / 81559MiB |
>>>>  0%      Default |
>>>> |                                         |                      |
>>>>         Disabled |
>>>>
>>>> +-----------------------------------------+----------------------+----------------------+
>>>> |   6  NVIDIA H100 80GB HBM3          On  | 00000000:D1:00.0 Off |
>>>>                0 |
>>>> | N/A   30C    P0              70W / 700W |      4MiB / 81559MiB |
>>>>  0%      Default |
>>>> |                                         |                      |
>>>>         Disabled |
>>>>
>>>> +-----------------------------------------+----------------------+----------------------+
>>>> |   7  NVIDIA H100 80GB HBM3          On  | 00000000:DF:00.0 Off |
>>>>                0 |
>>>> | N/A   32C    P0              69W / 700W |      4MiB / 81559MiB |
>>>>  0%      Default |
>>>> |                                         |                      |
>>>>         Disabled |
>>>>
>>>> +-----------------------------------------+----------------------+----------------------+
>>>>
>>>>
>>>> +---------------------------------------------------------------------------------------+
>>>> | Processes:
>>>>                  |
>>>> |  GPU   GI   CI        PID   Type   Process name
>>>>      GPU Memory |
>>>> |        ID   ID
>>>>       Usage      |
>>>>
>>>> |=======================================================================================|
>>>> |  No running processes found
>>>>                 |
>>>>
>>>> +---------------------------------------------------------------------------------------+
>>>>
>>>>
>>>> Best regards,
>>>> Lee
>>>>
>>>> On Fri, Nov 14, 2025 at 3:53 AM John Hearns <[email protected]> wrote:
>>>>
>>>> I work for AMD...
>>>> diagnostics I woud run are    lspci     nvidia-smi
>>>>
>>>> On Thu, 13 Nov 2025 at 19:18, Lee via slurm-users <
>>>> [email protected]> wrote:
>>>>
>>>> Good afternoon,
>>>>
>>>> I have a cluster that is managed by Base Command Manager (v10) and it
>>>> has several Nvidia DGXs.  dgx09 is a problem child.  The entire node was
>>>> RMA'd and now it no longer behaves the same as my other DGXs.  I think the
>>>> below symptoms are caused by a single underlying issue.
>>>>
>>>> *Symptoms : *
>>>> 1. When I look at our 8 non-MIG DGXs, via `scontrol show node=dgxXY |
>>>> grep Gres`, 7/8 DGXs report "Gres=gpu:*H100*:8(S:0-1)" while dgx09
>>>> reports "Gres=gpu:*h100*:8(S:0-1)"
>>>>
>>>> 2. When I submit a job to this node, I get :
>>>>
>>>> $ srun --reservation=g09_test --gres=gpu:2 --pty bash
>>>> srun: error: Unable to create step for job 105035: Invalid generic
>>>> resource (gres) specification
>>>>
>>>> ### No job is running on the node, yet AllocTRES shows consumed
>>>> resources...
>>>> $ scontrol show node=dgx09 | grep -i AllocTRES
>>>>    *AllocTRES=gres/gpu=2*
>>>>
>>>> ### dgx09 : /var/log/slurmd contains no information
>>>> ### slurmctld shows :
>>>> root@h01:# grep 105035 /var/log/slurmctld
>>>> [2025-11-13T07:44:56.380] sched: _slurm_rpc_allocate_resources
>>>> JobId=105035 NodeList=dgx09 usec=3420
>>>> [2025-11-13T07:44:56.482] _job_complete: JobId=105035 WTERMSIG 1
>>>> [2025-11-13T07:44:56.483] _job_complete: JobId=105035 done
>>>>
>>>>
>>>> *Configuration : *
>>>> 1. gres.conf :
>>>> # This section of this file was automatically generated by cmd. Do not
>>>> edit manually!
>>>> # BEGIN AUTOGENERATED SECTION -- DO NOT REMOVE
>>>> AutoDetect=NVML
>>>> NodeName=dgx[01,02] Name=gpu Type=1g.20gb Count=32 AutoDetect=NVML
>>>> NodeName=dgx[03-10] Name=gpu Type=h100 Count=8 AutoDetect=NVML
>>>> # END AUTOGENERATED SECTION   -- DO NOT REMOVE
>>>>
>>>> 2. grep NodeName slurm.conf
>>>> root@h01:# grep NodeName slurm.conf
>>>> NodeName=dgx[01,02] RealMemory=2063937 Boards=1 SocketsPerBoard=2
>>>> CoresPerSocket=56 ThreadsPerCore=2 MemSpecLimit=30017 Gres=gpu:1g.20gb:32
>>>> Feature=location=local
>>>> NodeName=dgx03-10] RealMemory=2063937 Boards=1 SocketsPerBoard=2
>>>> CoresPerSocket=56 ThreadsPerCore=2 MemSpecLimit=30017 Gres=gpu:H100:8
>>>> Feature=location=local
>>>>
>>>> 3. What slurmd detects on dgx09
>>>>
>>>> root@dgx09:~# slurmd -C
>>>> NodeName=dgx09 CPUs=224 Boards=1 SocketsPerBoard=2 CoresPerSocket=56
>>>> ThreadsPerCore=2 RealMemory=2063937
>>>> UpTime=8-00:39:10
>>>>
>>>> root@dgx09:~# slurmd -G
>>>> slurmd: gpu/nvml: _get_system_gpu_list_nvml: 8 GPU system device(s)
>>>> detected
>>>> slurmd: Gres Name=gpu Type=h100 Count=1 Index=0 ID=7696487
>>>> File=/dev/nvidia0 Cores=0-55 CoreCnt=224 Links=-1,0,0,0,0,0,0,0
>>>> Flags=HAS_FILE,HAS_TYPE,ENV_NVML
>>>> slurmd: Gres Name=gpu Type=h100 Count=1 Index=1 ID=7696487
>>>> File=/dev/nvidia1 Cores=0-55 CoreCnt=224 Links=0,-1,0,0,0,0,0,0
>>>> Flags=HAS_FILE,HAS_TYPE,ENV_NVML
>>>> slurmd: Gres Name=gpu Type=h100 Count=1 Index=2 ID=7696487
>>>> File=/dev/nvidia2 Cores=0-55 CoreCnt=224 Links=0,0,-1,0,0,0,0,0
>>>> Flags=HAS_FILE,HAS_TYPE,ENV_NVML
>>>> slurmd: Gres Name=gpu Type=h100 Count=1 Index=3 ID=7696487
>>>> File=/dev/nvidia3 Cores=0-55 CoreCnt=224 Links=0,0,0,-1,0,0,0,0
>>>> Flags=HAS_FILE,HAS_TYPE,ENV_NVML
>>>> slurmd: Gres Name=gpu Type=h100 Count=1 Index=4 ID=7696487
>>>> File=/dev/nvidia4 Cores=56-111 CoreCnt=224 Links=0,0,0,0,-1,0,0,0
>>>> Flags=HAS_FILE,HAS_TYPE,ENV_NVML
>>>> slurmd: Gres Name=gpu Type=h100 Count=1 Index=5 ID=7696487
>>>> File=/dev/nvidia5 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,-1,0,0
>>>> Flags=HAS_FILE,HAS_TYPE,ENV_NVML
>>>> slurmd: Gres Name=gpu Type=h100 Count=1 Index=6 ID=7696487
>>>> File=/dev/nvidia6 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,0,-1,0
>>>> Flags=HAS_FILE,HAS_TYPE,ENV_NVML
>>>> slurmd: Gres Name=gpu Type=h100 Count=1 Index=7 ID=7696487
>>>> File=/dev/nvidia7 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,0,0,-1
>>>> Flags=HAS_FILE,HAS_TYPE,ENV_NVML
>>>>
>>>>
>>>> *Questions : *
>>>> 1. As far as I can tell, dgx09 is identical to all my non-MIG DGX nodes
>>>> in terms of configuration and hardware.  Why does scontrol report it having
>>>> 'h100' with a lower case 'h' unlike the other dgxs which report with an
>>>> upper case 'H'?
>>>>
>>>> 2. Why is dgx09 not accepting GPU jobs and afterwards it artificially
>>>> thinks that there are GPUs allocated even though no jobs are on the node?
>>>>
>>>> 3. Are there additional tests / configurations that I can do to probe
>>>> the differences between dgx09 and all my other nodes?
>>>>
>>>> Best regards,
>>>> Lee
>>>>
>>>> --
>>>> slurm-users mailing list -- [email protected]
>>>> To unsubscribe send an email to [email protected]
>>>>
>>>>
>>> --
>>> slurm-users mailing list -- [email protected]
>>> To unsubscribe send an email to [email protected]
>>>
>>
>> --
>> slurm-users mailing list -- [email protected]
>> To unsubscribe send an email to [email protected]
>>
>
-- 
slurm-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to