unfortunately i don't know what your issue is, but i inclined to think it might be something odd with your reservation. adding a scontrol show reservation g09_test might be helpful to others
also if you haven't already, you might want to try increasing the debug logs to the max, something might be getting lost in the logs and add -vvv's to the srun On Wed, Nov 26, 2025 at 4:26 PM Lee via slurm-users <[email protected]> wrote: > > Hello, > > 1. Output from `scontrol show node=dgx09` > user@l01:~$ scontrol show node=dgx09 > NodeName=dgx09 Arch=x86_64 CoresPerSocket=56 > CPUAlloc=0 CPUEfctv=224 CPUTot=224 CPULoad=0.98 > AvailableFeatures=location=local > ActiveFeatures=location=local > Gres=gpu:h100:8(S:0-1) > NodeAddr=dgx09 NodeHostName=dgx09 Version=23.02.6 > OS=Linux 5.15.0-1042-nvidia #42-Ubuntu SMP Wed Nov 15 20:28:30 UTC 2023 > RealMemory=2063937 AllocMem=0 FreeMem=2033902 Sockets=2 Boards=1 > MemSpecLimit=30017 > State=IDLE+RESERVED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A > MCS_label=N/A > Partitions=defq > BootTime=2025-11-04T13:57:26 SlurmdStartTime=2025-11-05T15:40:46 > LastBusyTime=2025-11-25T13:07:36 ResumeAfterTime=None > CfgTRES=cpu=224,mem=2063937M,billing=448,gres/gpu=8 > AllocTRES= > CapWatts=n/a > CurrentWatts=0 AveWatts=0 > ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s > ReservationName=g09_test > > > 2. I don't see any errors in slurmctld related to dgx09, when I submit a job : > > user@l01:~$ srun --reservation=g09_test --nodelist=dgx09 --gres=gpu:2 --pty > bash > srun: error: Unable to create step for job 108596: Invalid generic resource > (gres) specification > > slurmctld shows : > [2025-11-26T10:57:42.592] sched: _slurm_rpc_allocate_resources JobId=108596 > NodeList=dgx09 usec=1495 > [2025-11-26T10:57:42.695] _job_complete: JobId=108596 WTERMSIG 1 > [2025-11-26T10:57:42.695] _job_complete: JobId=108596 done > > 3. Grep'ing for jobid and for errors on dgx09:/var/log/slurmd returns > nothing, i.e. > root@dgx09:~# grep -i error /var/log/slurmd. # no output > root@dgx09:~# grep -i 108596 /var/log/slurmd # no output > > Looking at journalctl : > root@dgx09:~# journalctl -fu slurmd.service > Nov 26 10:57:33 dgx09 slurmd[1751949]: slurmd: Resource spec: system cgroup > memory limit set to 30017 MBNov 26 10:57:34 dgx09 slurmd[1751949]: slurmd: > gpu/nvml: _get_system_gpu_list_nvml: 8 GPU system device(s) detected > > Best, > Lee > > On Tue, Nov 25, 2025 at 1:24 PM Russell Jones via slurm-users > <[email protected]> wrote: >> >> Can you give the output of "scontrol show node dgx09" ? >> >> Are there any errors in your slurmctld.log? >> >> Are there any errors in slurmd.log on dgx09 node? >> >> On Tue, Nov 25, 2025 at 12:13 PM Lee <[email protected]> wrote: >>> >>> Hello, >>> >>> @Russel - good catch. No, I'm not actually missing the square bracket. It >>> got lost during the copy/paste. I'll restate it below for clarity : >>> 2. grep NodeName slurm.conf >>> root@h01:# grep NodeName slurm.conf >>> NodeName=dgx[01,02] RealMemory=2063937 Boards=1 SocketsPerBoard=2 >>> CoresPerSocket=56 ThreadsPerCore=2 MemSpecLimit=30017 Gres=gpu:1g.20gb:32 >>> Feature=location=local >>> NodeName=dgx[03-10] RealMemory=2063937 Boards=1 SocketsPerBoard=2 >>> CoresPerSocket=56 ThreadsPerCore=2 MemSpecLimit=30017 Gres=gpu:H100:8 >>> Feature=location=local >>> >>> @Keshav : It still doesn't work >>> user@l01:~$ srun --reservation=g09_test --nodelist=dgx09 --gres=gpu:h100:2 >>> --pty bash >>> srun: error: Unable to create step for job 107044: Invalid generic resource >>> (gres) specification >>> >>> Best, >>> Lee >>> >>> >>> On Tue, Nov 25, 2025 at 12:49 PM Russell Jones via slurm-users >>> <[email protected]> wrote: >>>> >>>> > NodeName=dgx03-10] RealMemory=2063937 Boards=1 SocketsPerBoard=2 >>>> > CoresPerSocket=56 ThreadsPerCore=2 MemSpecLimit=30017 Gres=gpu:H100:8 >>>> > Feature=location=local >>>> >>>> Just in case, that line shows you are missing a bracket in the node name. >>>> Are you *actually* missing the bracket? >>>> >>>> >>>> On Tue, Nov 25, 2025 at 9:11 AM Lee via slurm-users >>>> <[email protected]> wrote: >>>>> >>>>> Hello, >>>>> >>>>> Sorry for the delayed response, SC25 interfered with my schedule. >>>>> >>>>> Answers : >>>>> 1. Yes, dgx09 and all the others boot the same software images. >>>>> >>>>> 2. dgx09 and the other nodes mount a shared file system where Slurm is >>>>> installed, so /cm/shared/apps/slurm/23.02.6/lib64/slurm/gpu_nvml.so is >>>>> the same for every node. I assume the library that is used for >>>>> autodetection lives there. I also found a shared library >>>>> /usr/lib/x86_64-linux-gnu/libnvml_injection.so.1.0 (within the software >>>>> image). I checked the md5sum and it is the same on both dgx09 and a >>>>> non-broken node. >>>>> >>>>> 3. `scontrol show config` is the same on dgx09 and a non-broken DGX. >>>>> >>>>> 4. The only meaningful difference between `scontrol show node` for dgx09 >>>>> and dgx08 (a working node) is : >>>>> >>>>> < Gres=gpu:h100:8(S:0-1) >>>>> --- >>>>> > Gres=gpu:H100:8(S:0-1) >>>>> >>>>> 5. Yes, we've restarted slurmd and slurmctld several times, the behavior >>>>> persists. Of note, when I run `scontrol reconfigure`, the phantom >>>>> allocated GPUs (see AllocTRES in original post) are cleared. >>>>> >>>>> >>>>> Important Update : >>>>> 1. We recently had another GPU tray replaced and now that DGX is >>>>> experiencing the same behavior as dgx09. I am more convinced that there >>>>> is something subtle with how the hardware is being detected by Slurm. >>>>> >>>>> Best regards, >>>>> Lee >>>>> >>>>> >>>>> On Mon, Nov 17, 2025 at 4:06 PM Timony, Mick >>>>> <[email protected]> wrote: >>>>>> >>>>>> Hi Lee, >>>>>> >>>>>> I manage a BCM cluster as well. Does DGX09 have the same disk image and >>>>>> libraries in place? Could the NVidia NVML library, used to auto-detect >>>>>> the GPU's, be a diff version and causing the case differences? >>>>>> >>>>>> If you compare the output of scontrol show node dgx09 and another DGX >>>>>> node, do they look the same? Does scontrol show config look different on >>>>>> DGX09 and other nodes? >>>>>> >>>>>> Have you restarted the Slurm controllers (slurmctld) and restarted >>>>>> slurmd on the compute nodes? >>>>>> >>>>>> Kind regards >>>>>> >>>>>> -- >>>>>> Mick Timony >>>>>> Senior DevOps Engineer >>>>>> LASER, Longwood, & O2 Cluster Admin >>>>>> Harvard Medical School >>>>>> -- >>>>>> ________________________________ >>>>>> From: Lee via slurm-users <[email protected]> >>>>>> Sent: Friday, November 14, 2025 7:17 AM >>>>>> To: John Hearns <[email protected]> >>>>>> Cc: [email protected] <[email protected]> >>>>>> Subject: [slurm-users] Re: Invalid generic resource (gres) specification >>>>>> after RMA >>>>>> >>>>>> Hello, >>>>>> >>>>>> Thank you for the suggestion. >>>>>> >>>>>> I ran lspci on dgx09 and a working DGX and the output was identical. >>>>>> >>>>>> nvidia-smi shows all 8 GPUs and looks the same as the output from a >>>>>> working DGX : >>>>>> >>>>>> root@dgx09:~# nvidia-smi >>>>>> Fri Nov 14 07:11:05 2025 >>>>>> +---------------------------------------------------------------------------------------+ >>>>>> | NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA >>>>>> Version: 12.2 | >>>>>> |-----------------------------------------+----------------------+----------------------+ >>>>>> | GPU Name Persistence-M | Bus-Id Disp.A | >>>>>> Volatile Uncorr. ECC | >>>>>> | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | >>>>>> GPU-Util Compute M. | >>>>>> | | | >>>>>> MIG M. | >>>>>> |=========================================+======================+======================| >>>>>> | 0 NVIDIA H100 80GB HBM3 On | 00000000:1B:00.0 Off | >>>>>> 0 | >>>>>> | N/A 29C P0 69W / 700W | 4MiB / 81559MiB | >>>>>> 0% Default | >>>>>> | | | >>>>>> Disabled | >>>>>> +-----------------------------------------+----------------------+----------------------+ >>>>>> | 1 NVIDIA H100 80GB HBM3 On | 00000000:43:00.0 Off | >>>>>> 0 | >>>>>> | N/A 30C P0 71W / 700W | 4MiB / 81559MiB | >>>>>> 0% Default | >>>>>> | | | >>>>>> Disabled | >>>>>> +-----------------------------------------+----------------------+----------------------+ >>>>>> | 2 NVIDIA H100 80GB HBM3 On | 00000000:52:00.0 Off | >>>>>> 0 | >>>>>> | N/A 33C P0 71W / 700W | 4MiB / 81559MiB | >>>>>> 0% Default | >>>>>> | | | >>>>>> Disabled | >>>>>> +-----------------------------------------+----------------------+----------------------+ >>>>>> | 3 NVIDIA H100 80GB HBM3 On | 00000000:61:00.0 Off | >>>>>> 0 | >>>>>> | N/A 31C P0 73W / 700W | 4MiB / 81559MiB | >>>>>> 0% Default | >>>>>> | | | >>>>>> Disabled | >>>>>> +-----------------------------------------+----------------------+----------------------+ >>>>>> | 4 NVIDIA H100 80GB HBM3 On | 00000000:9D:00.0 Off | >>>>>> 0 | >>>>>> | N/A 29C P0 68W / 700W | 4MiB / 81559MiB | >>>>>> 0% Default | >>>>>> | | | >>>>>> Disabled | >>>>>> +-----------------------------------------+----------------------+----------------------+ >>>>>> | 5 NVIDIA H100 80GB HBM3 On | 00000000:C3:00.0 Off | >>>>>> 0 | >>>>>> | N/A 28C P0 69W / 700W | 4MiB / 81559MiB | >>>>>> 0% Default | >>>>>> | | | >>>>>> Disabled | >>>>>> +-----------------------------------------+----------------------+----------------------+ >>>>>> | 6 NVIDIA H100 80GB HBM3 On | 00000000:D1:00.0 Off | >>>>>> 0 | >>>>>> | N/A 30C P0 70W / 700W | 4MiB / 81559MiB | >>>>>> 0% Default | >>>>>> | | | >>>>>> Disabled | >>>>>> +-----------------------------------------+----------------------+----------------------+ >>>>>> | 7 NVIDIA H100 80GB HBM3 On | 00000000:DF:00.0 Off | >>>>>> 0 | >>>>>> | N/A 32C P0 69W / 700W | 4MiB / 81559MiB | >>>>>> 0% Default | >>>>>> | | | >>>>>> Disabled | >>>>>> +-----------------------------------------+----------------------+----------------------+ >>>>>> >>>>>> +---------------------------------------------------------------------------------------+ >>>>>> | Processes: >>>>>> | >>>>>> | GPU GI CI PID Type Process name >>>>>> GPU Memory | >>>>>> | ID ID >>>>>> Usage | >>>>>> |=======================================================================================| >>>>>> | No running processes found >>>>>> | >>>>>> +---------------------------------------------------------------------------------------+ >>>>>> >>>>>> >>>>>> Best regards, >>>>>> Lee >>>>>> >>>>>> On Fri, Nov 14, 2025 at 3:53 AM John Hearns <[email protected]> wrote: >>>>>> >>>>>> I work for AMD... >>>>>> diagnostics I woud run are lspci nvidia-smi >>>>>> >>>>>> On Thu, 13 Nov 2025 at 19:18, Lee via slurm-users >>>>>> <[email protected]> wrote: >>>>>> >>>>>> Good afternoon, >>>>>> >>>>>> I have a cluster that is managed by Base Command Manager (v10) and it >>>>>> has several Nvidia DGXs. dgx09 is a problem child. The entire node was >>>>>> RMA'd and now it no longer behaves the same as my other DGXs. I think >>>>>> the below symptoms are caused by a single underlying issue. >>>>>> >>>>>> Symptoms : >>>>>> 1. When I look at our 8 non-MIG DGXs, via `scontrol show node=dgxXY | >>>>>> grep Gres`, 7/8 DGXs report "Gres=gpu:H100:8(S:0-1)" while dgx09 reports >>>>>> "Gres=gpu:h100:8(S:0-1)" >>>>>> >>>>>> 2. When I submit a job to this node, I get : >>>>>> >>>>>> $ srun --reservation=g09_test --gres=gpu:2 --pty bash >>>>>> srun: error: Unable to create step for job 105035: Invalid generic >>>>>> resource (gres) specification >>>>>> >>>>>> ### No job is running on the node, yet AllocTRES shows consumed >>>>>> resources... >>>>>> $ scontrol show node=dgx09 | grep -i AllocTRES >>>>>> AllocTRES=gres/gpu=2 >>>>>> >>>>>> ### dgx09 : /var/log/slurmd contains no information >>>>>> ### slurmctld shows : >>>>>> root@h01:# grep 105035 /var/log/slurmctld >>>>>> [2025-11-13T07:44:56.380] sched: _slurm_rpc_allocate_resources >>>>>> JobId=105035 NodeList=dgx09 usec=3420 >>>>>> [2025-11-13T07:44:56.482] _job_complete: JobId=105035 WTERMSIG 1 >>>>>> [2025-11-13T07:44:56.483] _job_complete: JobId=105035 done >>>>>> >>>>>> >>>>>> Configuration : >>>>>> 1. gres.conf : >>>>>> # This section of this file was automatically generated by cmd. Do not >>>>>> edit manually! >>>>>> # BEGIN AUTOGENERATED SECTION -- DO NOT REMOVE >>>>>> AutoDetect=NVML >>>>>> NodeName=dgx[01,02] Name=gpu Type=1g.20gb Count=32 AutoDetect=NVML >>>>>> NodeName=dgx[03-10] Name=gpu Type=h100 Count=8 AutoDetect=NVML >>>>>> # END AUTOGENERATED SECTION -- DO NOT REMOVE >>>>>> >>>>>> 2. grep NodeName slurm.conf >>>>>> root@h01:# grep NodeName slurm.conf >>>>>> NodeName=dgx[01,02] RealMemory=2063937 Boards=1 SocketsPerBoard=2 >>>>>> CoresPerSocket=56 ThreadsPerCore=2 MemSpecLimit=30017 >>>>>> Gres=gpu:1g.20gb:32 Feature=location=local >>>>>> NodeName=dgx03-10] RealMemory=2063937 Boards=1 SocketsPerBoard=2 >>>>>> CoresPerSocket=56 ThreadsPerCore=2 MemSpecLimit=30017 Gres=gpu:H100:8 >>>>>> Feature=location=local >>>>>> >>>>>> 3. What slurmd detects on dgx09 >>>>>> >>>>>> root@dgx09:~# slurmd -C >>>>>> NodeName=dgx09 CPUs=224 Boards=1 SocketsPerBoard=2 CoresPerSocket=56 >>>>>> ThreadsPerCore=2 RealMemory=2063937 >>>>>> UpTime=8-00:39:10 >>>>>> >>>>>> root@dgx09:~# slurmd -G >>>>>> slurmd: gpu/nvml: _get_system_gpu_list_nvml: 8 GPU system device(s) >>>>>> detected >>>>>> slurmd: Gres Name=gpu Type=h100 Count=1 Index=0 ID=7696487 >>>>>> File=/dev/nvidia0 Cores=0-55 CoreCnt=224 Links=-1,0,0,0,0,0,0,0 >>>>>> Flags=HAS_FILE,HAS_TYPE,ENV_NVML >>>>>> slurmd: Gres Name=gpu Type=h100 Count=1 Index=1 ID=7696487 >>>>>> File=/dev/nvidia1 Cores=0-55 CoreCnt=224 Links=0,-1,0,0,0,0,0,0 >>>>>> Flags=HAS_FILE,HAS_TYPE,ENV_NVML >>>>>> slurmd: Gres Name=gpu Type=h100 Count=1 Index=2 ID=7696487 >>>>>> File=/dev/nvidia2 Cores=0-55 CoreCnt=224 Links=0,0,-1,0,0,0,0,0 >>>>>> Flags=HAS_FILE,HAS_TYPE,ENV_NVML >>>>>> slurmd: Gres Name=gpu Type=h100 Count=1 Index=3 ID=7696487 >>>>>> File=/dev/nvidia3 Cores=0-55 CoreCnt=224 Links=0,0,0,-1,0,0,0,0 >>>>>> Flags=HAS_FILE,HAS_TYPE,ENV_NVML >>>>>> slurmd: Gres Name=gpu Type=h100 Count=1 Index=4 ID=7696487 >>>>>> File=/dev/nvidia4 Cores=56-111 CoreCnt=224 Links=0,0,0,0,-1,0,0,0 >>>>>> Flags=HAS_FILE,HAS_TYPE,ENV_NVML >>>>>> slurmd: Gres Name=gpu Type=h100 Count=1 Index=5 ID=7696487 >>>>>> File=/dev/nvidia5 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,-1,0,0 >>>>>> Flags=HAS_FILE,HAS_TYPE,ENV_NVML >>>>>> slurmd: Gres Name=gpu Type=h100 Count=1 Index=6 ID=7696487 >>>>>> File=/dev/nvidia6 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,0,-1,0 >>>>>> Flags=HAS_FILE,HAS_TYPE,ENV_NVML >>>>>> slurmd: Gres Name=gpu Type=h100 Count=1 Index=7 ID=7696487 >>>>>> File=/dev/nvidia7 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,0,0,-1 >>>>>> Flags=HAS_FILE,HAS_TYPE,ENV_NVML >>>>>> >>>>>> >>>>>> Questions : >>>>>> 1. As far as I can tell, dgx09 is identical to all my non-MIG DGX nodes >>>>>> in terms of configuration and hardware. Why does scontrol report it >>>>>> having 'h100' with a lower case 'h' unlike the other dgxs which report >>>>>> with an upper case 'H'? >>>>>> >>>>>> 2. Why is dgx09 not accepting GPU jobs and afterwards it artificially >>>>>> thinks that there are GPUs allocated even though no jobs are on the node? >>>>>> >>>>>> 3. Are there additional tests / configurations that I can do to probe >>>>>> the differences between dgx09 and all my other nodes? >>>>>> >>>>>> Best regards, >>>>>> Lee >>>>>> >>>>>> -- >>>>>> slurm-users mailing list -- [email protected] >>>>>> To unsubscribe send an email to [email protected] >>>>> >>>>> >>>>> -- >>>>> slurm-users mailing list -- [email protected] >>>>> To unsubscribe send an email to [email protected] >>>> >>>> >>>> -- >>>> slurm-users mailing list -- [email protected] >>>> To unsubscribe send an email to [email protected] >> >> >> -- >> slurm-users mailing list -- [email protected] >> To unsubscribe send an email to [email protected] > > > -- > slurm-users mailing list -- [email protected] > To unsubscribe send an email to [email protected] -- slurm-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
