[slurm-users] Re: Invalid generic resource (gres) specification after RMA

Lee via slurm-users Fri, 14 Nov 2025 04:18:00 -0800

Hello,

Thank you for the suggestion.


I ran lspci on dgx09 and a working DGX and the output was identical.

nvidia-smi shows all 8 GPUs and looks the same as the output from a working
DGX :

root@dgx09:~# nvidia-smi
Fri Nov 14 07:11:05 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA
Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile
Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util
 Compute M. |
|                                         |                      |
      MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA H100 80GB HBM3          On  | 00000000:1B:00.0 Off |
           0 |
| N/A   29C    P0              69W / 700W |      4MiB / 81559MiB |      0%
     Default |
|                                         |                      |
    Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA H100 80GB HBM3          On  | 00000000:43:00.0 Off |
           0 |
| N/A   30C    P0              71W / 700W |      4MiB / 81559MiB |      0%
     Default |
|                                         |                      |
    Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA H100 80GB HBM3          On  | 00000000:52:00.0 Off |
           0 |
| N/A   33C    P0              71W / 700W |      4MiB / 81559MiB |      0%
     Default |
|                                         |                      |
    Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA H100 80GB HBM3          On  | 00000000:61:00.0 Off |
           0 |
| N/A   31C    P0              73W / 700W |      4MiB / 81559MiB |      0%
     Default |
|                                         |                      |
    Disabled |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA H100 80GB HBM3          On  | 00000000:9D:00.0 Off |
           0 |
| N/A   29C    P0              68W / 700W |      4MiB / 81559MiB |      0%
     Default |
|                                         |                      |
    Disabled |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA H100 80GB HBM3          On  | 00000000:C3:00.0 Off |
           0 |
| N/A   28C    P0              69W / 700W |      4MiB / 81559MiB |      0%
     Default |
|                                         |                      |
    Disabled |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA H100 80GB HBM3          On  | 00000000:D1:00.0 Off |
           0 |
| N/A   30C    P0              70W / 700W |      4MiB / 81559MiB |      0%
     Default |
|                                         |                      |
    Disabled |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA H100 80GB HBM3          On  | 00000000:DF:00.0 Off |
           0 |
| N/A   32C    P0              69W / 700W |      4MiB / 81559MiB |      0%
     Default |
|                                         |                      |
    Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:
             |
|  GPU   GI   CI        PID   Type   Process name
 GPU Memory |
|        ID   ID
  Usage      |
|=======================================================================================|
|  No running processes found
            |
+---------------------------------------------------------------------------------------+


Best regards,
Lee

On Fri, Nov 14, 2025 at 3:53 AM John Hearns <[email protected]> wrote:

> I work for AMD...
> diagnostics I woud run are    lspci     nvidia-smi
>
> On Thu, 13 Nov 2025 at 19:18, Lee via slurm-users <
> [email protected]> wrote:
>
>> Good afternoon,
>>
>> I have a cluster that is managed by Base Command Manager (v10) and it has
>> several Nvidia DGXs.  dgx09 is a problem child.  The entire node was RMA'd
>> and now it no longer behaves the same as my other DGXs.  I think the below
>> symptoms are caused by a single underlying issue.
>>
>> *Symptoms : *
>> 1. When I look at our 8 non-MIG DGXs, via `scontrol show node=dgxXY |
>> grep Gres`, 7/8 DGXs report "Gres=gpu:*H100*:8(S:0-1)" while dgx09
>> reports "Gres=gpu:*h100*:8(S:0-1)"
>>
>> 2. When I submit a job to this node, I get :
>>
>> $ srun --reservation=g09_test --gres=gpu:2 --pty bash
>> srun: error: Unable to create step for job 105035: Invalid generic
>> resource (gres) specification
>>
>> ### No job is running on the node, yet AllocTRES shows consumed
>> resources...
>> $ scontrol show node=dgx09 | grep -i AllocTRES
>>    *AllocTRES=gres/gpu=2*
>>
>> ### dgx09 : /var/log/slurmd contains no information
>> ### slurmctld shows :
>> root@h01:# grep 105035 /var/log/slurmctld
>> [2025-11-13T07:44:56.380] sched: _slurm_rpc_allocate_resources
>> JobId=105035 NodeList=dgx09 usec=3420
>> [2025-11-13T07:44:56.482] _job_complete: JobId=105035 WTERMSIG 1
>> [2025-11-13T07:44:56.483] _job_complete: JobId=105035 done
>>
>>
>> *Configuration : *
>> 1. gres.conf :
>> # This section of this file was automatically generated by cmd. Do not
>> edit manually!
>> # BEGIN AUTOGENERATED SECTION -- DO NOT REMOVE
>> AutoDetect=NVML
>> NodeName=dgx[01,02] Name=gpu Type=1g.20gb Count=32 AutoDetect=NVML
>> NodeName=dgx[03-10] Name=gpu Type=h100 Count=8 AutoDetect=NVML
>> # END AUTOGENERATED SECTION   -- DO NOT REMOVE
>>
>> 2. grep NodeName slurm.conf
>> root@h01:# grep NodeName slurm.conf
>> NodeName=dgx[01,02] RealMemory=2063937 Boards=1 SocketsPerBoard=2
>> CoresPerSocket=56 ThreadsPerCore=2 MemSpecLimit=30017 Gres=gpu:1g.20gb:32
>> Feature=location=local
>> NodeName=dgx03-10] RealMemory=2063937 Boards=1 SocketsPerBoard=2
>> CoresPerSocket=56 ThreadsPerCore=2 MemSpecLimit=30017 Gres=gpu:H100:8
>> Feature=location=local
>>
>> 3. What slurmd detects on dgx09
>>
>> root@dgx09:~# slurmd -C
>> NodeName=dgx09 CPUs=224 Boards=1 SocketsPerBoard=2 CoresPerSocket=56
>> ThreadsPerCore=2 RealMemory=2063937
>> UpTime=8-00:39:10
>>
>> root@dgx09:~# slurmd -G
>> slurmd: gpu/nvml: _get_system_gpu_list_nvml: 8 GPU system device(s)
>> detected
>> slurmd: Gres Name=gpu Type=h100 Count=1 Index=0 ID=7696487
>> File=/dev/nvidia0 Cores=0-55 CoreCnt=224 Links=-1,0,0,0,0,0,0,0
>> Flags=HAS_FILE,HAS_TYPE,ENV_NVML
>> slurmd: Gres Name=gpu Type=h100 Count=1 Index=1 ID=7696487
>> File=/dev/nvidia1 Cores=0-55 CoreCnt=224 Links=0,-1,0,0,0,0,0,0
>> Flags=HAS_FILE,HAS_TYPE,ENV_NVML
>> slurmd: Gres Name=gpu Type=h100 Count=1 Index=2 ID=7696487
>> File=/dev/nvidia2 Cores=0-55 CoreCnt=224 Links=0,0,-1,0,0,0,0,0
>> Flags=HAS_FILE,HAS_TYPE,ENV_NVML
>> slurmd: Gres Name=gpu Type=h100 Count=1 Index=3 ID=7696487
>> File=/dev/nvidia3 Cores=0-55 CoreCnt=224 Links=0,0,0,-1,0,0,0,0
>> Flags=HAS_FILE,HAS_TYPE,ENV_NVML
>> slurmd: Gres Name=gpu Type=h100 Count=1 Index=4 ID=7696487
>> File=/dev/nvidia4 Cores=56-111 CoreCnt=224 Links=0,0,0,0,-1,0,0,0
>> Flags=HAS_FILE,HAS_TYPE,ENV_NVML
>> slurmd: Gres Name=gpu Type=h100 Count=1 Index=5 ID=7696487
>> File=/dev/nvidia5 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,-1,0,0
>> Flags=HAS_FILE,HAS_TYPE,ENV_NVML
>> slurmd: Gres Name=gpu Type=h100 Count=1 Index=6 ID=7696487
>> File=/dev/nvidia6 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,0,-1,0
>> Flags=HAS_FILE,HAS_TYPE,ENV_NVML
>> slurmd: Gres Name=gpu Type=h100 Count=1 Index=7 ID=7696487
>> File=/dev/nvidia7 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,0,0,-1
>> Flags=HAS_FILE,HAS_TYPE,ENV_NVML
>>
>>
>> *Questions : *
>> 1. As far as I can tell, dgx09 is identical to all my non-MIG DGX nodes
>> in terms of configuration and hardware.  Why does scontrol report it having
>> 'h100' with a lower case 'h' unlike the other dgxs which report with an
>> upper case 'H'?
>>
>> 2. Why is dgx09 not accepting GPU jobs and afterwards it artificially
>> thinks that there are GPUs allocated even though no jobs are on the node?
>>
>> 3. Are there additional tests / configurations that I can do to probe the
>> differences between dgx09 and all my other nodes?
>>
>> Best regards,
>> Lee
>>
>> --
>> slurm-users mailing list -- [email protected]
>> To unsubscribe send an email to [email protected]
>>
>

-- 
slurm-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[slurm-users] Re: Invalid generic resource (gres) specification after RMA

Reply via email to