I work for AMD...
diagnostics I woud run are    lspci     nvidia-smi

On Thu, 13 Nov 2025 at 19:18, Lee via slurm-users <
[email protected]> wrote:

> Good afternoon,
>
> I have a cluster that is managed by Base Command Manager (v10) and it has
> several Nvidia DGXs.  dgx09 is a problem child.  The entire node was RMA'd
> and now it no longer behaves the same as my other DGXs.  I think the below
> symptoms are caused by a single underlying issue.
>
> *Symptoms : *
> 1. When I look at our 8 non-MIG DGXs, via `scontrol show node=dgxXY | grep
> Gres`, 7/8 DGXs report "Gres=gpu:*H100*:8(S:0-1)" while dgx09 reports "
> Gres=gpu:*h100*:8(S:0-1)"
>
> 2. When I submit a job to this node, I get :
>
> $ srun --reservation=g09_test --gres=gpu:2 --pty bash
> srun: error: Unable to create step for job 105035: Invalid generic
> resource (gres) specification
>
> ### No job is running on the node, yet AllocTRES shows consumed
> resources...
> $ scontrol show node=dgx09 | grep -i AllocTRES
>    *AllocTRES=gres/gpu=2*
>
> ### dgx09 : /var/log/slurmd contains no information
> ### slurmctld shows :
> root@h01:# grep 105035 /var/log/slurmctld
> [2025-11-13T07:44:56.380] sched: _slurm_rpc_allocate_resources
> JobId=105035 NodeList=dgx09 usec=3420
> [2025-11-13T07:44:56.482] _job_complete: JobId=105035 WTERMSIG 1
> [2025-11-13T07:44:56.483] _job_complete: JobId=105035 done
>
>
> *Configuration : *
> 1. gres.conf :
> # This section of this file was automatically generated by cmd. Do not
> edit manually!
> # BEGIN AUTOGENERATED SECTION -- DO NOT REMOVE
> AutoDetect=NVML
> NodeName=dgx[01,02] Name=gpu Type=1g.20gb Count=32 AutoDetect=NVML
> NodeName=dgx[03-10] Name=gpu Type=h100 Count=8 AutoDetect=NVML
> # END AUTOGENERATED SECTION   -- DO NOT REMOVE
>
> 2. grep NodeName slurm.conf
> root@h01:# grep NodeName slurm.conf
> NodeName=dgx[01,02] RealMemory=2063937 Boards=1 SocketsPerBoard=2
> CoresPerSocket=56 ThreadsPerCore=2 MemSpecLimit=30017 Gres=gpu:1g.20gb:32
> Feature=location=local
> NodeName=dgx03-10] RealMemory=2063937 Boards=1 SocketsPerBoard=2
> CoresPerSocket=56 ThreadsPerCore=2 MemSpecLimit=30017 Gres=gpu:H100:8
> Feature=location=local
>
> 3. What slurmd detects on dgx09
>
> root@dgx09:~# slurmd -C
> NodeName=dgx09 CPUs=224 Boards=1 SocketsPerBoard=2 CoresPerSocket=56
> ThreadsPerCore=2 RealMemory=2063937
> UpTime=8-00:39:10
>
> root@dgx09:~# slurmd -G
> slurmd: gpu/nvml: _get_system_gpu_list_nvml: 8 GPU system device(s)
> detected
> slurmd: Gres Name=gpu Type=h100 Count=1 Index=0 ID=7696487
> File=/dev/nvidia0 Cores=0-55 CoreCnt=224 Links=-1,0,0,0,0,0,0,0
> Flags=HAS_FILE,HAS_TYPE,ENV_NVML
> slurmd: Gres Name=gpu Type=h100 Count=1 Index=1 ID=7696487
> File=/dev/nvidia1 Cores=0-55 CoreCnt=224 Links=0,-1,0,0,0,0,0,0
> Flags=HAS_FILE,HAS_TYPE,ENV_NVML
> slurmd: Gres Name=gpu Type=h100 Count=1 Index=2 ID=7696487
> File=/dev/nvidia2 Cores=0-55 CoreCnt=224 Links=0,0,-1,0,0,0,0,0
> Flags=HAS_FILE,HAS_TYPE,ENV_NVML
> slurmd: Gres Name=gpu Type=h100 Count=1 Index=3 ID=7696487
> File=/dev/nvidia3 Cores=0-55 CoreCnt=224 Links=0,0,0,-1,0,0,0,0
> Flags=HAS_FILE,HAS_TYPE,ENV_NVML
> slurmd: Gres Name=gpu Type=h100 Count=1 Index=4 ID=7696487
> File=/dev/nvidia4 Cores=56-111 CoreCnt=224 Links=0,0,0,0,-1,0,0,0
> Flags=HAS_FILE,HAS_TYPE,ENV_NVML
> slurmd: Gres Name=gpu Type=h100 Count=1 Index=5 ID=7696487
> File=/dev/nvidia5 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,-1,0,0
> Flags=HAS_FILE,HAS_TYPE,ENV_NVML
> slurmd: Gres Name=gpu Type=h100 Count=1 Index=6 ID=7696487
> File=/dev/nvidia6 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,0,-1,0
> Flags=HAS_FILE,HAS_TYPE,ENV_NVML
> slurmd: Gres Name=gpu Type=h100 Count=1 Index=7 ID=7696487
> File=/dev/nvidia7 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,0,0,-1
> Flags=HAS_FILE,HAS_TYPE,ENV_NVML
>
>
> *Questions : *
> 1. As far as I can tell, dgx09 is identical to all my non-MIG DGX nodes in
> terms of configuration and hardware.  Why does scontrol report it having
> 'h100' with a lower case 'h' unlike the other dgxs which report with an
> upper case 'H'?
>
> 2. Why is dgx09 not accepting GPU jobs and afterwards it artificially
> thinks that there are GPUs allocated even though no jobs are on the node?
>
> 3. Are there additional tests / configurations that I can do to probe the
> differences between dgx09 and all my other nodes?
>
> Best regards,
> Lee
>
> --
> slurm-users mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
>
-- 
slurm-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to