Can you give the output of "scontrol show node dgx09" ? Are there any errors in your slurmctld.log?
Are there any errors in slurmd.log on dgx09 node? On Tue, Nov 25, 2025 at 12:13 PM Lee <[email protected]> wrote: > Hello, > > @Russel - good catch. No, I'm not actually missing the square bracket. > It got lost during the copy/paste. I'll restate it below for clarity : > 2. grep NodeName slurm.conf > root@h01:# grep NodeName slurm.conf > NodeName=dgx[01,02] RealMemory=2063937 Boards=1 SocketsPerBoard=2 > CoresPerSocket=56 ThreadsPerCore=2 MemSpecLimit=30017 Gres=gpu:1g.20gb:32 > Feature=location=local > NodeName=dgx*[*03-10] RealMemory=2063937 Boards=1 SocketsPerBoard=2 > CoresPerSocket=56 ThreadsPerCore=2 MemSpecLimit=30017 Gres=gpu:H100:8 > Feature=location=local > > @Keshav : It still doesn't work > user@l01:~$ srun --reservation=g09_test --nodelist=dgx09 > --gres=gpu:h100:2 --pty bash > srun: error: Unable to create step for job 107044: Invalid generic > resource (gres) specification > > Best, > Lee > > > On Tue, Nov 25, 2025 at 12:49 PM Russell Jones via slurm-users < > [email protected]> wrote: > >> > NodeName=dgx03-10] RealMemory=2063937 Boards=1 SocketsPerBoard=2 >> CoresPerSocket=56 ThreadsPerCore=2 MemSpecLimit=30017 Gres=gpu:H100:8 >> Feature=location=local >> >> Just in case, that line shows you are missing a bracket in the node name. >> Are you *actually* missing the bracket? >> >> >> On Tue, Nov 25, 2025 at 9:11 AM Lee via slurm-users < >> [email protected]> wrote: >> >>> Hello, >>> >>> Sorry for the delayed response, SC25 interfered with my schedule. >>> >>> *Answers* : >>> 1. Yes, dgx09 and all the others boot the same software images. >>> >>> 2. dgx09 and the other nodes mount a shared file system where Slurm is >>> installed, so /cm/shared/apps/slurm/23.02.6/lib64/slurm/gpu_nvml.so is >>> the same for every node. I assume the library that is used for >>> autodetection lives there. I also found a shared library >>> /usr/lib/x86_64-linux-gnu/libnvml_injection.so.1.0 >>> (within the software image). I checked the md5sum and it is the same on >>> both dgx09 and a non-broken node. >>> >>> 3. `scontrol show config` is the same on dgx09 and a non-broken DGX. >>> >>> 4. The only meaningful difference between `scontrol show node` for dgx09 >>> and dgx08 (a working node) is : >>> >>> < Gres=gpu:*h100*:8(S:0-1) >>> --- >>> > Gres=gpu:*H100*:8(S:0-1) >>> >>> 5. Yes, we've restarted slurmd and slurmctld several times, the behavior >>> persists. Of note, when I run `scontrol reconfigure`, the phantom >>> allocated GPUs (see AllocTRES in original post) are cleared. >>> >>> >>> *Important Update :* >>> 1. We recently had another GPU tray replaced and now that DGX is >>> experiencing the same behavior as dgx09. I am more convinced that there is >>> something subtle with how the hardware is being detected by Slurm. >>> >>> Best regards, >>> Lee >>> >>> >>> On Mon, Nov 17, 2025 at 4:06 PM Timony, Mick < >>> [email protected]> wrote: >>> >>>> Hi Lee, >>>> >>>> I manage a BCM cluster as well. Does DGX09 have the same disk image and >>>> libraries in place? Could the NVidia NVML library, used to auto-detect the >>>> GPU's, be a diff version and causing the case differences? >>>> >>>> If you compare the output of scontrol show node dgx09 and another DGX >>>> node, do they look the same? Does scontrol show config look different >>>> on DGX09 and other nodes? >>>> >>>> Have you restarted the Slurm controllers (slurmctld) and restarted >>>> slurmd on the compute nodes? >>>> >>>> Kind regards >>>> >>>> -- >>>> Mick Timony >>>> Senior DevOps Engineer >>>> LASER, Longwood, & O2 Cluster Admin >>>> Harvard Medical School >>>> -- >>>> ------------------------------ >>>> *From:* Lee via slurm-users <[email protected]> >>>> *Sent:* Friday, November 14, 2025 7:17 AM >>>> *To:* John Hearns <[email protected]> >>>> *Cc:* [email protected] <[email protected]> >>>> *Subject:* [slurm-users] Re: Invalid generic resource (gres) >>>> specification after RMA >>>> >>>> Hello, >>>> >>>> Thank you for the suggestion. >>>> >>>> I ran lspci on dgx09 and a working DGX and the output was identical. >>>> >>>> nvidia-smi shows all 8 GPUs and looks the same as the output from a >>>> working DGX : >>>> >>>> root@dgx09:~# nvidia-smi >>>> Fri Nov 14 07:11:05 2025 >>>> >>>> +---------------------------------------------------------------------------------------+ >>>> | NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA >>>> Version: 12.2 | >>>> >>>> |-----------------------------------------+----------------------+----------------------+ >>>> | GPU Name Persistence-M | Bus-Id Disp.A | >>>> Volatile Uncorr. ECC | >>>> | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | >>>> GPU-Util Compute M. | >>>> | | | >>>> MIG M. | >>>> >>>> |=========================================+======================+======================| >>>> | 0 NVIDIA H100 80GB HBM3 On | 00000000:1B:00.0 Off | >>>> 0 | >>>> | N/A 29C P0 69W / 700W | 4MiB / 81559MiB | >>>> 0% Default | >>>> | | | >>>> Disabled | >>>> >>>> +-----------------------------------------+----------------------+----------------------+ >>>> | 1 NVIDIA H100 80GB HBM3 On | 00000000:43:00.0 Off | >>>> 0 | >>>> | N/A 30C P0 71W / 700W | 4MiB / 81559MiB | >>>> 0% Default | >>>> | | | >>>> Disabled | >>>> >>>> +-----------------------------------------+----------------------+----------------------+ >>>> | 2 NVIDIA H100 80GB HBM3 On | 00000000:52:00.0 Off | >>>> 0 | >>>> | N/A 33C P0 71W / 700W | 4MiB / 81559MiB | >>>> 0% Default | >>>> | | | >>>> Disabled | >>>> >>>> +-----------------------------------------+----------------------+----------------------+ >>>> | 3 NVIDIA H100 80GB HBM3 On | 00000000:61:00.0 Off | >>>> 0 | >>>> | N/A 31C P0 73W / 700W | 4MiB / 81559MiB | >>>> 0% Default | >>>> | | | >>>> Disabled | >>>> >>>> +-----------------------------------------+----------------------+----------------------+ >>>> | 4 NVIDIA H100 80GB HBM3 On | 00000000:9D:00.0 Off | >>>> 0 | >>>> | N/A 29C P0 68W / 700W | 4MiB / 81559MiB | >>>> 0% Default | >>>> | | | >>>> Disabled | >>>> >>>> +-----------------------------------------+----------------------+----------------------+ >>>> | 5 NVIDIA H100 80GB HBM3 On | 00000000:C3:00.0 Off | >>>> 0 | >>>> | N/A 28C P0 69W / 700W | 4MiB / 81559MiB | >>>> 0% Default | >>>> | | | >>>> Disabled | >>>> >>>> +-----------------------------------------+----------------------+----------------------+ >>>> | 6 NVIDIA H100 80GB HBM3 On | 00000000:D1:00.0 Off | >>>> 0 | >>>> | N/A 30C P0 70W / 700W | 4MiB / 81559MiB | >>>> 0% Default | >>>> | | | >>>> Disabled | >>>> >>>> +-----------------------------------------+----------------------+----------------------+ >>>> | 7 NVIDIA H100 80GB HBM3 On | 00000000:DF:00.0 Off | >>>> 0 | >>>> | N/A 32C P0 69W / 700W | 4MiB / 81559MiB | >>>> 0% Default | >>>> | | | >>>> Disabled | >>>> >>>> +-----------------------------------------+----------------------+----------------------+ >>>> >>>> >>>> +---------------------------------------------------------------------------------------+ >>>> | Processes: >>>> | >>>> | GPU GI CI PID Type Process name >>>> GPU Memory | >>>> | ID ID >>>> Usage | >>>> >>>> |=======================================================================================| >>>> | No running processes found >>>> | >>>> >>>> +---------------------------------------------------------------------------------------+ >>>> >>>> >>>> Best regards, >>>> Lee >>>> >>>> On Fri, Nov 14, 2025 at 3:53 AM John Hearns <[email protected]> wrote: >>>> >>>> I work for AMD... >>>> diagnostics I woud run are lspci nvidia-smi >>>> >>>> On Thu, 13 Nov 2025 at 19:18, Lee via slurm-users < >>>> [email protected]> wrote: >>>> >>>> Good afternoon, >>>> >>>> I have a cluster that is managed by Base Command Manager (v10) and it >>>> has several Nvidia DGXs. dgx09 is a problem child. The entire node was >>>> RMA'd and now it no longer behaves the same as my other DGXs. I think the >>>> below symptoms are caused by a single underlying issue. >>>> >>>> *Symptoms : * >>>> 1. When I look at our 8 non-MIG DGXs, via `scontrol show node=dgxXY | >>>> grep Gres`, 7/8 DGXs report "Gres=gpu:*H100*:8(S:0-1)" while dgx09 >>>> reports "Gres=gpu:*h100*:8(S:0-1)" >>>> >>>> 2. When I submit a job to this node, I get : >>>> >>>> $ srun --reservation=g09_test --gres=gpu:2 --pty bash >>>> srun: error: Unable to create step for job 105035: Invalid generic >>>> resource (gres) specification >>>> >>>> ### No job is running on the node, yet AllocTRES shows consumed >>>> resources... >>>> $ scontrol show node=dgx09 | grep -i AllocTRES >>>> *AllocTRES=gres/gpu=2* >>>> >>>> ### dgx09 : /var/log/slurmd contains no information >>>> ### slurmctld shows : >>>> root@h01:# grep 105035 /var/log/slurmctld >>>> [2025-11-13T07:44:56.380] sched: _slurm_rpc_allocate_resources >>>> JobId=105035 NodeList=dgx09 usec=3420 >>>> [2025-11-13T07:44:56.482] _job_complete: JobId=105035 WTERMSIG 1 >>>> [2025-11-13T07:44:56.483] _job_complete: JobId=105035 done >>>> >>>> >>>> *Configuration : * >>>> 1. gres.conf : >>>> # This section of this file was automatically generated by cmd. Do not >>>> edit manually! >>>> # BEGIN AUTOGENERATED SECTION -- DO NOT REMOVE >>>> AutoDetect=NVML >>>> NodeName=dgx[01,02] Name=gpu Type=1g.20gb Count=32 AutoDetect=NVML >>>> NodeName=dgx[03-10] Name=gpu Type=h100 Count=8 AutoDetect=NVML >>>> # END AUTOGENERATED SECTION -- DO NOT REMOVE >>>> >>>> 2. grep NodeName slurm.conf >>>> root@h01:# grep NodeName slurm.conf >>>> NodeName=dgx[01,02] RealMemory=2063937 Boards=1 SocketsPerBoard=2 >>>> CoresPerSocket=56 ThreadsPerCore=2 MemSpecLimit=30017 Gres=gpu:1g.20gb:32 >>>> Feature=location=local >>>> NodeName=dgx03-10] RealMemory=2063937 Boards=1 SocketsPerBoard=2 >>>> CoresPerSocket=56 ThreadsPerCore=2 MemSpecLimit=30017 Gres=gpu:H100:8 >>>> Feature=location=local >>>> >>>> 3. What slurmd detects on dgx09 >>>> >>>> root@dgx09:~# slurmd -C >>>> NodeName=dgx09 CPUs=224 Boards=1 SocketsPerBoard=2 CoresPerSocket=56 >>>> ThreadsPerCore=2 RealMemory=2063937 >>>> UpTime=8-00:39:10 >>>> >>>> root@dgx09:~# slurmd -G >>>> slurmd: gpu/nvml: _get_system_gpu_list_nvml: 8 GPU system device(s) >>>> detected >>>> slurmd: Gres Name=gpu Type=h100 Count=1 Index=0 ID=7696487 >>>> File=/dev/nvidia0 Cores=0-55 CoreCnt=224 Links=-1,0,0,0,0,0,0,0 >>>> Flags=HAS_FILE,HAS_TYPE,ENV_NVML >>>> slurmd: Gres Name=gpu Type=h100 Count=1 Index=1 ID=7696487 >>>> File=/dev/nvidia1 Cores=0-55 CoreCnt=224 Links=0,-1,0,0,0,0,0,0 >>>> Flags=HAS_FILE,HAS_TYPE,ENV_NVML >>>> slurmd: Gres Name=gpu Type=h100 Count=1 Index=2 ID=7696487 >>>> File=/dev/nvidia2 Cores=0-55 CoreCnt=224 Links=0,0,-1,0,0,0,0,0 >>>> Flags=HAS_FILE,HAS_TYPE,ENV_NVML >>>> slurmd: Gres Name=gpu Type=h100 Count=1 Index=3 ID=7696487 >>>> File=/dev/nvidia3 Cores=0-55 CoreCnt=224 Links=0,0,0,-1,0,0,0,0 >>>> Flags=HAS_FILE,HAS_TYPE,ENV_NVML >>>> slurmd: Gres Name=gpu Type=h100 Count=1 Index=4 ID=7696487 >>>> File=/dev/nvidia4 Cores=56-111 CoreCnt=224 Links=0,0,0,0,-1,0,0,0 >>>> Flags=HAS_FILE,HAS_TYPE,ENV_NVML >>>> slurmd: Gres Name=gpu Type=h100 Count=1 Index=5 ID=7696487 >>>> File=/dev/nvidia5 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,-1,0,0 >>>> Flags=HAS_FILE,HAS_TYPE,ENV_NVML >>>> slurmd: Gres Name=gpu Type=h100 Count=1 Index=6 ID=7696487 >>>> File=/dev/nvidia6 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,0,-1,0 >>>> Flags=HAS_FILE,HAS_TYPE,ENV_NVML >>>> slurmd: Gres Name=gpu Type=h100 Count=1 Index=7 ID=7696487 >>>> File=/dev/nvidia7 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,0,0,-1 >>>> Flags=HAS_FILE,HAS_TYPE,ENV_NVML >>>> >>>> >>>> *Questions : * >>>> 1. As far as I can tell, dgx09 is identical to all my non-MIG DGX nodes >>>> in terms of configuration and hardware. Why does scontrol report it having >>>> 'h100' with a lower case 'h' unlike the other dgxs which report with an >>>> upper case 'H'? >>>> >>>> 2. Why is dgx09 not accepting GPU jobs and afterwards it artificially >>>> thinks that there are GPUs allocated even though no jobs are on the node? >>>> >>>> 3. Are there additional tests / configurations that I can do to probe >>>> the differences between dgx09 and all my other nodes? >>>> >>>> Best regards, >>>> Lee >>>> >>>> -- >>>> slurm-users mailing list -- [email protected] >>>> To unsubscribe send an email to [email protected] >>>> >>>> >>> -- >>> slurm-users mailing list -- [email protected] >>> To unsubscribe send an email to [email protected] >>> >> >> -- >> slurm-users mailing list -- [email protected] >> To unsubscribe send an email to [email protected] >> >
-- slurm-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
