Hello, 1. Output from `scontrol show node=dgx09` user@l01:~$ scontrol show node=dgx09 NodeName=dgx09 Arch=x86_64 CoresPerSocket=56 CPUAlloc=0 CPUEfctv=224 CPUTot=224 CPULoad=0.98 AvailableFeatures=location=local ActiveFeatures=location=local Gres=gpu:h100:8(S:0-1) NodeAddr=dgx09 NodeHostName=dgx09 Version=23.02.6 OS=Linux 5.15.0-1042-nvidia #42-Ubuntu SMP Wed Nov 15 20:28:30 UTC 2023 RealMemory=2063937 AllocMem=0 FreeMem=2033902 Sockets=2 Boards=1 MemSpecLimit=30017 State=IDLE+RESERVED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=defq BootTime=2025-11-04T13:57:26 SlurmdStartTime=2025-11-05T15:40:46 LastBusyTime=2025-11-25T13:07:36 ResumeAfterTime=None CfgTRES=cpu=224,mem=2063937M,billing=448,gres/gpu=8 AllocTRES= CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s ReservationName=g09_test
2. I don't see any errors in slurmctld related to dgx09, when I submit a job : user@l01:~$ srun --reservation=g09_test --nodelist=dgx09 --gres=gpu:2 --pty bash srun: error: Unable to create step for job 108596: Invalid generic resource (gres) specification slurmctld shows : [2025-11-26T10:57:42.592] sched: _slurm_rpc_allocate_resources JobId=108596 NodeList=dgx09 usec=1495 [2025-11-26T10:57:42.695] _job_complete: JobId=108596 WTERMSIG 1 [2025-11-26T10:57:42.695] _job_complete: JobId=108596 done 3. Grep'ing for jobid and for errors on dgx09:/var/log/slurmd returns nothing, i.e. root@dgx09:~# grep -i error /var/log/slurmd. # no output root@dgx09:~# grep -i 108596 /var/log/slurmd # no output Looking at journalctl : root@dgx09:~# journalctl -fu slurmd.service Nov 26 10:57:33 dgx09 slurmd[1751949]: slurmd: Resource spec: system cgroup memory limit set to 30017 MBNov 26 10:57:34 dgx09 slurmd[1751949]: slurmd: gpu/nvml: _get_system_gpu_list_nvml: 8 GPU system device(s) detected Best, Lee On Tue, Nov 25, 2025 at 1:24 PM Russell Jones via slurm-users < [email protected]> wrote: > Can you give the output of "scontrol show node dgx09" ? > > Are there any errors in your slurmctld.log? > > Are there any errors in slurmd.log on dgx09 node? > > On Tue, Nov 25, 2025 at 12:13 PM Lee <[email protected]> wrote: > >> Hello, >> >> @Russel - good catch. No, I'm not actually missing the square bracket. >> It got lost during the copy/paste. I'll restate it below for clarity : >> 2. grep NodeName slurm.conf >> root@h01:# grep NodeName slurm.conf >> NodeName=dgx[01,02] RealMemory=2063937 Boards=1 SocketsPerBoard=2 >> CoresPerSocket=56 ThreadsPerCore=2 MemSpecLimit=30017 Gres=gpu:1g.20gb:32 >> Feature=location=local >> NodeName=dgx*[*03-10] RealMemory=2063937 Boards=1 SocketsPerBoard=2 >> CoresPerSocket=56 ThreadsPerCore=2 MemSpecLimit=30017 Gres=gpu:H100:8 >> Feature=location=local >> >> @Keshav : It still doesn't work >> user@l01:~$ srun --reservation=g09_test --nodelist=dgx09 >> --gres=gpu:h100:2 --pty bash >> srun: error: Unable to create step for job 107044: Invalid generic >> resource (gres) specification >> >> Best, >> Lee >> >> >> On Tue, Nov 25, 2025 at 12:49 PM Russell Jones via slurm-users < >> [email protected]> wrote: >> >>> > NodeName=dgx03-10] RealMemory=2063937 Boards=1 SocketsPerBoard=2 >>> CoresPerSocket=56 ThreadsPerCore=2 MemSpecLimit=30017 Gres=gpu:H100:8 >>> Feature=location=local >>> >>> Just in case, that line shows you are missing a bracket in the node >>> name. Are you *actually* missing the bracket? >>> >>> >>> On Tue, Nov 25, 2025 at 9:11 AM Lee via slurm-users < >>> [email protected]> wrote: >>> >>>> Hello, >>>> >>>> Sorry for the delayed response, SC25 interfered with my schedule. >>>> >>>> *Answers* : >>>> 1. Yes, dgx09 and all the others boot the same software images. >>>> >>>> 2. dgx09 and the other nodes mount a shared file system where Slurm is >>>> installed, so /cm/shared/apps/slurm/23.02.6/lib64/slurm/gpu_nvml.so is >>>> the same for every node. I assume the library that is used for >>>> autodetection lives there. I also found a shared library >>>> /usr/lib/x86_64-linux-gnu/libnvml_injection.so.1.0 >>>> (within the software image). I checked the md5sum and it is the same on >>>> both dgx09 and a non-broken node. >>>> >>>> 3. `scontrol show config` is the same on dgx09 and a non-broken DGX. >>>> >>>> 4. The only meaningful difference between `scontrol show node` for >>>> dgx09 and dgx08 (a working node) is : >>>> >>>> < Gres=gpu:*h100*:8(S:0-1) >>>> --- >>>> > Gres=gpu:*H100*:8(S:0-1) >>>> >>>> 5. Yes, we've restarted slurmd and slurmctld several times, the >>>> behavior persists. Of note, when I run `scontrol reconfigure`, the phantom >>>> allocated GPUs (see AllocTRES in original post) are cleared. >>>> >>>> >>>> *Important Update :* >>>> 1. We recently had another GPU tray replaced and now that DGX is >>>> experiencing the same behavior as dgx09. I am more convinced that there is >>>> something subtle with how the hardware is being detected by Slurm. >>>> >>>> Best regards, >>>> Lee >>>> >>>> >>>> On Mon, Nov 17, 2025 at 4:06 PM Timony, Mick < >>>> [email protected]> wrote: >>>> >>>>> Hi Lee, >>>>> >>>>> I manage a BCM cluster as well. Does DGX09 have the same disk image >>>>> and libraries in place? Could the NVidia NVML library, used to auto-detect >>>>> the GPU's, be a diff version and causing the case differences? >>>>> >>>>> If you compare the output of scontrol show node dgx09 and another DGX >>>>> node, do they look the same? Does scontrol show config look different >>>>> on DGX09 and other nodes? >>>>> >>>>> Have you restarted the Slurm controllers (slurmctld) and restarted >>>>> slurmd on the compute nodes? >>>>> >>>>> Kind regards >>>>> >>>>> -- >>>>> Mick Timony >>>>> Senior DevOps Engineer >>>>> LASER, Longwood, & O2 Cluster Admin >>>>> Harvard Medical School >>>>> -- >>>>> ------------------------------ >>>>> *From:* Lee via slurm-users <[email protected]> >>>>> *Sent:* Friday, November 14, 2025 7:17 AM >>>>> *To:* John Hearns <[email protected]> >>>>> *Cc:* [email protected] <[email protected]> >>>>> *Subject:* [slurm-users] Re: Invalid generic resource (gres) >>>>> specification after RMA >>>>> >>>>> Hello, >>>>> >>>>> Thank you for the suggestion. >>>>> >>>>> I ran lspci on dgx09 and a working DGX and the output was identical. >>>>> >>>>> nvidia-smi shows all 8 GPUs and looks the same as the output from a >>>>> working DGX : >>>>> >>>>> root@dgx09:~# nvidia-smi >>>>> Fri Nov 14 07:11:05 2025 >>>>> >>>>> +---------------------------------------------------------------------------------------+ >>>>> | NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA >>>>> Version: 12.2 | >>>>> >>>>> |-----------------------------------------+----------------------+----------------------+ >>>>> | GPU Name Persistence-M | Bus-Id Disp.A | >>>>> Volatile Uncorr. ECC | >>>>> | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | >>>>> GPU-Util Compute M. | >>>>> | | | >>>>> MIG M. | >>>>> >>>>> |=========================================+======================+======================| >>>>> | 0 NVIDIA H100 80GB HBM3 On | 00000000:1B:00.0 Off | >>>>> 0 | >>>>> | N/A 29C P0 69W / 700W | 4MiB / 81559MiB | >>>>> 0% Default | >>>>> | | | >>>>> Disabled | >>>>> >>>>> +-----------------------------------------+----------------------+----------------------+ >>>>> | 1 NVIDIA H100 80GB HBM3 On | 00000000:43:00.0 Off | >>>>> 0 | >>>>> | N/A 30C P0 71W / 700W | 4MiB / 81559MiB | >>>>> 0% Default | >>>>> | | | >>>>> Disabled | >>>>> >>>>> +-----------------------------------------+----------------------+----------------------+ >>>>> | 2 NVIDIA H100 80GB HBM3 On | 00000000:52:00.0 Off | >>>>> 0 | >>>>> | N/A 33C P0 71W / 700W | 4MiB / 81559MiB | >>>>> 0% Default | >>>>> | | | >>>>> Disabled | >>>>> >>>>> +-----------------------------------------+----------------------+----------------------+ >>>>> | 3 NVIDIA H100 80GB HBM3 On | 00000000:61:00.0 Off | >>>>> 0 | >>>>> | N/A 31C P0 73W / 700W | 4MiB / 81559MiB | >>>>> 0% Default | >>>>> | | | >>>>> Disabled | >>>>> >>>>> +-----------------------------------------+----------------------+----------------------+ >>>>> | 4 NVIDIA H100 80GB HBM3 On | 00000000:9D:00.0 Off | >>>>> 0 | >>>>> | N/A 29C P0 68W / 700W | 4MiB / 81559MiB | >>>>> 0% Default | >>>>> | | | >>>>> Disabled | >>>>> >>>>> +-----------------------------------------+----------------------+----------------------+ >>>>> | 5 NVIDIA H100 80GB HBM3 On | 00000000:C3:00.0 Off | >>>>> 0 | >>>>> | N/A 28C P0 69W / 700W | 4MiB / 81559MiB | >>>>> 0% Default | >>>>> | | | >>>>> Disabled | >>>>> >>>>> +-----------------------------------------+----------------------+----------------------+ >>>>> | 6 NVIDIA H100 80GB HBM3 On | 00000000:D1:00.0 Off | >>>>> 0 | >>>>> | N/A 30C P0 70W / 700W | 4MiB / 81559MiB | >>>>> 0% Default | >>>>> | | | >>>>> Disabled | >>>>> >>>>> +-----------------------------------------+----------------------+----------------------+ >>>>> | 7 NVIDIA H100 80GB HBM3 On | 00000000:DF:00.0 Off | >>>>> 0 | >>>>> | N/A 32C P0 69W / 700W | 4MiB / 81559MiB | >>>>> 0% Default | >>>>> | | | >>>>> Disabled | >>>>> >>>>> +-----------------------------------------+----------------------+----------------------+ >>>>> >>>>> >>>>> +---------------------------------------------------------------------------------------+ >>>>> | Processes: >>>>> | >>>>> | GPU GI CI PID Type Process name >>>>> GPU Memory | >>>>> | ID ID >>>>> Usage | >>>>> >>>>> |=======================================================================================| >>>>> | No running processes found >>>>> | >>>>> >>>>> +---------------------------------------------------------------------------------------+ >>>>> >>>>> >>>>> Best regards, >>>>> Lee >>>>> >>>>> On Fri, Nov 14, 2025 at 3:53 AM John Hearns <[email protected]> wrote: >>>>> >>>>> I work for AMD... >>>>> diagnostics I woud run are lspci nvidia-smi >>>>> >>>>> On Thu, 13 Nov 2025 at 19:18, Lee via slurm-users < >>>>> [email protected]> wrote: >>>>> >>>>> Good afternoon, >>>>> >>>>> I have a cluster that is managed by Base Command Manager (v10) and it >>>>> has several Nvidia DGXs. dgx09 is a problem child. The entire node was >>>>> RMA'd and now it no longer behaves the same as my other DGXs. I think the >>>>> below symptoms are caused by a single underlying issue. >>>>> >>>>> *Symptoms : * >>>>> 1. When I look at our 8 non-MIG DGXs, via `scontrol show node=dgxXY | >>>>> grep Gres`, 7/8 DGXs report "Gres=gpu:*H100*:8(S:0-1)" while dgx09 >>>>> reports "Gres=gpu:*h100*:8(S:0-1)" >>>>> >>>>> 2. When I submit a job to this node, I get : >>>>> >>>>> $ srun --reservation=g09_test --gres=gpu:2 --pty bash >>>>> srun: error: Unable to create step for job 105035: Invalid generic >>>>> resource (gres) specification >>>>> >>>>> ### No job is running on the node, yet AllocTRES shows consumed >>>>> resources... >>>>> $ scontrol show node=dgx09 | grep -i AllocTRES >>>>> *AllocTRES=gres/gpu=2* >>>>> >>>>> ### dgx09 : /var/log/slurmd contains no information >>>>> ### slurmctld shows : >>>>> root@h01:# grep 105035 /var/log/slurmctld >>>>> [2025-11-13T07:44:56.380] sched: _slurm_rpc_allocate_resources >>>>> JobId=105035 NodeList=dgx09 usec=3420 >>>>> [2025-11-13T07:44:56.482] _job_complete: JobId=105035 WTERMSIG 1 >>>>> [2025-11-13T07:44:56.483] _job_complete: JobId=105035 done >>>>> >>>>> >>>>> *Configuration : * >>>>> 1. gres.conf : >>>>> # This section of this file was automatically generated by cmd. Do not >>>>> edit manually! >>>>> # BEGIN AUTOGENERATED SECTION -- DO NOT REMOVE >>>>> AutoDetect=NVML >>>>> NodeName=dgx[01,02] Name=gpu Type=1g.20gb Count=32 AutoDetect=NVML >>>>> NodeName=dgx[03-10] Name=gpu Type=h100 Count=8 AutoDetect=NVML >>>>> # END AUTOGENERATED SECTION -- DO NOT REMOVE >>>>> >>>>> 2. grep NodeName slurm.conf >>>>> root@h01:# grep NodeName slurm.conf >>>>> NodeName=dgx[01,02] RealMemory=2063937 Boards=1 SocketsPerBoard=2 >>>>> CoresPerSocket=56 ThreadsPerCore=2 MemSpecLimit=30017 Gres=gpu:1g.20gb:32 >>>>> Feature=location=local >>>>> NodeName=dgx03-10] RealMemory=2063937 Boards=1 SocketsPerBoard=2 >>>>> CoresPerSocket=56 ThreadsPerCore=2 MemSpecLimit=30017 Gres=gpu:H100:8 >>>>> Feature=location=local >>>>> >>>>> 3. What slurmd detects on dgx09 >>>>> >>>>> root@dgx09:~# slurmd -C >>>>> NodeName=dgx09 CPUs=224 Boards=1 SocketsPerBoard=2 CoresPerSocket=56 >>>>> ThreadsPerCore=2 RealMemory=2063937 >>>>> UpTime=8-00:39:10 >>>>> >>>>> root@dgx09:~# slurmd -G >>>>> slurmd: gpu/nvml: _get_system_gpu_list_nvml: 8 GPU system device(s) >>>>> detected >>>>> slurmd: Gres Name=gpu Type=h100 Count=1 Index=0 ID=7696487 >>>>> File=/dev/nvidia0 Cores=0-55 CoreCnt=224 Links=-1,0,0,0,0,0,0,0 >>>>> Flags=HAS_FILE,HAS_TYPE,ENV_NVML >>>>> slurmd: Gres Name=gpu Type=h100 Count=1 Index=1 ID=7696487 >>>>> File=/dev/nvidia1 Cores=0-55 CoreCnt=224 Links=0,-1,0,0,0,0,0,0 >>>>> Flags=HAS_FILE,HAS_TYPE,ENV_NVML >>>>> slurmd: Gres Name=gpu Type=h100 Count=1 Index=2 ID=7696487 >>>>> File=/dev/nvidia2 Cores=0-55 CoreCnt=224 Links=0,0,-1,0,0,0,0,0 >>>>> Flags=HAS_FILE,HAS_TYPE,ENV_NVML >>>>> slurmd: Gres Name=gpu Type=h100 Count=1 Index=3 ID=7696487 >>>>> File=/dev/nvidia3 Cores=0-55 CoreCnt=224 Links=0,0,0,-1,0,0,0,0 >>>>> Flags=HAS_FILE,HAS_TYPE,ENV_NVML >>>>> slurmd: Gres Name=gpu Type=h100 Count=1 Index=4 ID=7696487 >>>>> File=/dev/nvidia4 Cores=56-111 CoreCnt=224 Links=0,0,0,0,-1,0,0,0 >>>>> Flags=HAS_FILE,HAS_TYPE,ENV_NVML >>>>> slurmd: Gres Name=gpu Type=h100 Count=1 Index=5 ID=7696487 >>>>> File=/dev/nvidia5 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,-1,0,0 >>>>> Flags=HAS_FILE,HAS_TYPE,ENV_NVML >>>>> slurmd: Gres Name=gpu Type=h100 Count=1 Index=6 ID=7696487 >>>>> File=/dev/nvidia6 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,0,-1,0 >>>>> Flags=HAS_FILE,HAS_TYPE,ENV_NVML >>>>> slurmd: Gres Name=gpu Type=h100 Count=1 Index=7 ID=7696487 >>>>> File=/dev/nvidia7 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,0,0,-1 >>>>> Flags=HAS_FILE,HAS_TYPE,ENV_NVML >>>>> >>>>> >>>>> *Questions : * >>>>> 1. As far as I can tell, dgx09 is identical to all my non-MIG DGX >>>>> nodes in terms of configuration and hardware. Why does scontrol report it >>>>> having 'h100' with a lower case 'h' unlike the other dgxs which report >>>>> with >>>>> an upper case 'H'? >>>>> >>>>> 2. Why is dgx09 not accepting GPU jobs and afterwards it artificially >>>>> thinks that there are GPUs allocated even though no jobs are on the node? >>>>> >>>>> 3. Are there additional tests / configurations that I can do to probe >>>>> the differences between dgx09 and all my other nodes? >>>>> >>>>> Best regards, >>>>> Lee >>>>> >>>>> -- >>>>> slurm-users mailing list -- [email protected] >>>>> To unsubscribe send an email to [email protected] >>>>> >>>>> >>>> -- >>>> slurm-users mailing list -- [email protected] >>>> To unsubscribe send an email to [email protected] >>>> >>> >>> -- >>> slurm-users mailing list -- [email protected] >>> To unsubscribe send an email to [email protected] >>> >> > -- > slurm-users mailing list -- [email protected] > To unsubscribe send an email to [email protected] >
-- slurm-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
