[slurm-users] Re: Invalid generic resource (gres) specification after RMA

Lee via slurm-users Thu, 04 Dec 2025 14:00:55 -0800

Hello,

*@Reed *- Great suggestion.  I do see a variety of different "Board Part
Number", but I don't see a correlation with the Board Part Number and
whether a DGX works or not


*@Russel, @Michael* - The behavior still exists even when the reservation
is removed.  I added the reservation to prevent user production work from
landing on the node and still be able to debug dgx09.  For completion here
is the reservation :

$ scontrol show reservation
    ReservationName=g09_test StartTime=2025-11-04T13:23:47
EndTime=2026-11-04T13:23:47 Duration=365-00:00:00
       Nodes=dgx09 NodeCnt=1 CoreCnt=112 Features=(null)
PartitionName=(null) Flags=SPEC_NODES
       TRES=cpu=224
       Users=user Groups=(null) Accounts=(null) Licenses=(null)
State=ACTIVE BurstBuffer=(null) Watts=n/a
       MaxStartDelay=(null)

*@Christopher* - I am running Slurm version 23.02.6.  Regarding making sure
that the GPU names are the same, I ran `slurmd -G` on dgx[03-09] and wrote
the contents of each to a file.  I then ran diff between output from each
dgx[03-08] and compared it to dgx09.  They are identical.  Reposting the
output from `slurmd -G` on dgx[03-09] :

$ slurmd -G
slurmd: gpu/nvml: _get_system_gpu_list_nvml: 8 GPU system device(s) detected
slurmd: Gres Name=gpu Type=h100 Count=1 Index=0 ID=7696487
File=/dev/nvidia0 Cores=0-55 CoreCnt=224 Links=-1,0,0,0,0,0,0,0
Flags=HAS_FILE,HAS_TYPE,ENV_NVML
slurmd: Gres Name=gpu Type=h100 Count=1 Index=1 ID=7696487
File=/dev/nvidia1 Cores=0-55 CoreCnt=224 Links=0,-1,0,0,0,0,0,0
Flags=HAS_FILE,HAS_TYPE,ENV_NVML
slurmd: Gres Name=gpu Type=h100 Count=1 Index=2 ID=7696487
File=/dev/nvidia2 Cores=0-55 CoreCnt=224 Links=0,0,-1,0,0,0,0,0
Flags=HAS_FILE,HAS_TYPE,ENV_NVML
slurmd: Gres Name=gpu Type=h100 Count=1 Index=3 ID=7696487
File=/dev/nvidia3 Cores=0-55 CoreCnt=224 Links=0,0,0,-1,0,0,0,0
Flags=HAS_FILE,HAS_TYPE,ENV_NVML
slurmd: Gres Name=gpu Type=h100 Count=1 Index=4 ID=7696487
File=/dev/nvidia4 Cores=56-111 CoreCnt=224 Links=0,0,0,0,-1,0,0,0
Flags=HAS_FILE,HAS_TYPE,ENV_NVML
slurmd: Gres Name=gpu Type=h100 Count=1 Index=5 ID=7696487
File=/dev/nvidia5 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,-1,0,0
Flags=HAS_FILE,HAS_TYPE,ENV_NVML
slurmd: Gres Name=gpu Type=h100 Count=1 Index=6 ID=7696487
File=/dev/nvidia6 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,0,-1,0
Flags=HAS_FILE,HAS_TYPE,ENV_NVML
slurmd: Gres Name=gpu Type=h100 Count=1 Index=7 ID=7696487
File=/dev/nvidia7 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,0,0,-1
Flags=HAS_FILE,HAS_TYPE,ENV_NVML

*@Christopher* - I tried copying
/cm/local/apps/slurm/var/spool/hwloc_topo_whole.xml
from a working node to dgx09.  I restarted slurmd on dgx09.  When I
submitted a job, requesting a GPU, I got the same error :
srun: error: Unable to create step for job 113424: Invalid generic resource
(gres) specification


*@Michael *- Running srun with -vvv :
$ srun -vvv --gres=gpu:1 --reservation=g09_test --pty bash
srun: defined options
srun: -------------------- --------------------
srun: gres                : gres:gpu:1
srun: pty                 :
srun: reservation         : g09_test
srun: verbose             : 3
srun: -------------------- --------------------
srun: end of defined options
srun: debug:  propagating RLIMIT_CPU=18446744073709551615
srun: debug:  propagating RLIMIT_FSIZE=18446744073709551615
srun: debug:  propagating RLIMIT_DATA=18446744073709551615
srun: debug:  propagating RLIMIT_STACK=18446744073709551615
srun: debug:  propagating RLIMIT_CORE=0
srun: debug:  propagating RLIMIT_RSS=18446744073709551615
srun: debug:  propagating RLIMIT_NPROC=2061374
srun: debug:  propagating RLIMIT_NOFILE=131072
srun: debug:  propagating RLIMIT_MEMLOCK=18446744073709551615
srun: debug:  propagating RLIMIT_AS=18446744073709551615
srun: debug:  propagating SLURM_PRIO_PROCESS=0
srun: debug:  propagating UMASK=0022
srun: debug:  Entering slurm_allocation_msg_thr_create()
srun: debug:  port from net_stream_listen is 44393
srun: debug:  Entering _msg_thr_internal
srun: Waiting for resource configuration
srun: Nodes dgx09 are ready for job
srun: jobid 113415: nodes(1):`dgx09', cpu counts: 2(x1)
srun: debug2: creating job with 1 tasks
srun: debug2: cpu:2 is not a gres:
srun: debug:  requesting job 113415, user 99, nodes 1 including ((null))
srun: debug:  cpus 2, tasks 1, name bash, relative 65534
srun: error: Unable to create step for job 113415: Invalid generic resource
(gres) specification
srun: debug2: eio_message_socket_accept: got message connection from
148.117.15.76:51912 6


Best,
Lee


On Wed, Nov 26, 2025 at 7:33 PM Russell Jones via slurm-users <
[email protected]> wrote:

> Yes I agree about the reservation, that was the next thing I was about to
> focus on.....
>
> Please do show your res config.
>
> On Wed, Nov 26, 2025, 3:26 PM Christopher Samuel via slurm-users <
> [email protected]> wrote:
>
>> On 11/13/25 2:16 pm, Lee via slurm-users wrote:
>>
>> > 1. When I look at our 8 non-MIG DGXs, via `scontrol show node=dgxXY |
>> > grep Gres`, 7/8 DGXs report "Gres=gpu:*H100*:8(S:0-1)" while dgx09
>> > reports "Gres=gpu:*h100*:8(S:0-1)"
>>
>> Two thoughts:
>>
>> 1) Looking at the 24.11 code when it's using NVML to get the names
>> everything gets lowercased - so I wonder if these new ones are getting
>> correctly discovered by NVML but the older ones are not and so using the
>> uppercase values in your config?
>>
>>         gpu_common_underscorify_tolower(device_name);
>>
>> I would suggest making sure the GPU names are lower-cased everywhere for
>> consistency.
>>
>> 2) From memory (away from work at the moment) slurmd caches hwloc
>> library information in an XML file - you might want to go and find that
>> on an older and newer node and compare those to see if you see the same
>> difference there.  It could be interesting to see if you stop slurmd on
>> an older node, move that XML file out of the way start slurmd whether
>> that changes how it reports the node.
>>
>> Also I saw you posted "slurmd -G" on the new one, could you post that
>> from an older one too please?
>>
>> Best of luck,
>> Chris
>> --
>> Chris Samuel  :  http://www.csamuel.org/  :  Philadelphia, PA, USA
>>
>> --
>> slurm-users mailing list -- [email protected]
>> To unsubscribe send an email to [email protected]
>>
>
> --
> slurm-users mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
>

-- 
slurm-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[slurm-users] Re: Invalid generic resource (gres) specification after RMA

Reply via email to