Re: [slurm-users] Nodes stay drained no matter what I do

Patrick Goetz Fri, 25 Aug 2023 07:28:18 -0700

Hi Tina -

Thanks for the confirmation! I will make this adjustment to gres.conf.


On 8/25/23 04:50, Tina Friedrich wrote:

Hi Patrick,

we certainly use that information to set affinity, yes. Our gres.conffiles (node-specific, as our config management creates them locally from 'nvidia-smi topo -m') - look like this:


Name=gpu Type=a100 File=/dev/nvidia0 CPUs=0-23
Name=gpu Type=a100 File=/dev/nvidia1 CPUs=0-23
Name=gpu Type=a100 File=/dev/nvidia2 CPUs=24-47
Name=gpu Type=a100 File=/dev/nvidia3 CPUs=24-47

which means that the processor affinity is known, and you can requestGPUs as '--gres=gpu:a100:X'.


Tina

On 24/08/2023 23:17, Patrick Goetz wrote:

Hi Mick -

Thanks for these suggestions. I read over both release notes, butdidn't find anything helpful.

Note that I didn't include gres.conf in my original post. That wouldbe this:


   NodeName=titan-[3-15] Name=gpu File=/dev/nvidia[0-7]
   NodeName=dgx-2 Name=gpu File=/dev/nvidia[0-6]
   NodeName=dgx-[3-6] Name=gpu File=/dev/nvidia[0-7]

Everything is working now, but some schedmd comment alerted me to thishighly useful command:


   # nvidia-smi topo -m

Now I'm wondering if I should be expressing CPU affinity explicitly inthe gres.conf file.



On 8/24/23 11:24, Timony, Mick wrote:

Hi Patrick,

You may want to review the release notes for 19.05 and anyintermediate versions:


https://github.com/SchedMD/slurm/blob/slurm-19-05-5-1/RELEASE_NOTES 
<https://github.com/SchedMD/slurm/blob/slurm-19-05-5-1/RELEASE_NOTES>

https://github.com/SchedMD/slurm/blob/slurm-18-08-9-1/RELEASE_NOTES 
<https://github.com/SchedMD/slurm/blob/slurm-18-08-9-1/RELEASE_NOTES>

I'd also check the |slurmd.log| on the compute nodes. It's usuallyin |/var/log/slurm/slurmd.log|

I'm not 100% sure your gres.conf is correct, we use one gres.conf forall our nodes, it looks something like this:


|NodeName=gpu-[1,2] Name=gpu Type=teslaM40 File=/dev/nvidia[0-3]|
|NodeName=gpu-[3,6] Name=gpu Type=teslaK80 File=/dev/nvidia[0-7]|
|NodeName=gpu-[7-9] Name=gpu Type=teslaV100 File=/dev/nvidia[0-3]|

SchedMd docs example is a little different as they have a uniquegres.conf by node in their example at:


https://github.com/SchedMD/slurm/blob/slurm-19-05-5-1/doc/man/man5/gres.conf.5 
<https://github.com/SchedMD/slurm/blob/slurm-19-05-5-1/doc/man/man5/gres.conf.5>

|Name=gpu Type=gtx560 File=/dev/nvidia0 COREs=0,1|

I don't see |Name| in your |gres.conf|?

Kind regards

--
Mick Timony
Senior DevOps Engineer
Harvard Medical School
--

------------------------------------------------------------------------

*From:* slurm-users <[email protected]> on behalfof Patrick Goetz <[email protected]>

*Sent:* Thursday, August 24, 2023 11:27 AM
*To:* Slurm User Community List <[email protected]>
*Subject:* [slurm-users] Nodes stay drained no matter what I do

Master/Nodes: Ubuntu 20.04, Slurm 19.05.5 (as packaged by Debian)

This is an upgrade from a working Ubuntu 18.04/Slurm 17.x system where I
re-used the original slurm.conf (fearing this might cause issues).  The
hardware is the same.  The Master and nodes all use the same slurm.conf,
gres.conf, and cgroup.conf files which are soft linked into
/etc/slurm-llnl from an NFS mounted filesystem.

As per the subject, the nodes refuse to revert to idle:

-----------------------------------------------------------
root@hypnotoad:~# sinfo -N -l
Thu Aug 24 10:01:20 2023
NODELIST   NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK
WEIGHT AVAIL_FE REASON
dgx-2          1       dgx     drained   80   80:1:1 500000        0
   1   (null) gres/gpu count repor
dgx-3          1       dgx     drained   80   80:1:1 500000        0
   1   (null) gres/gpu count repor
dgx-4          1       dgx     drained   80   80:1:1 500000        0
   1   (null) gres/gpu count
...
titan-3        1   titans*     drained   40   40:1:1 250000        0
   1   (null) gres/gpu count report
...
-----------------------------------------------------------

Neither of these commands has any effect:

    scontrol update NodeName=dgx-[2-6] State=RESUME
    scontrol update state=idle nodename=dgx-[2-6]


When I check the slurmctld log I find this helpful information:

-----------------------------------------------------------
...
[2023-08-24T00:00:00.033] error: _slurm_rpc_node_registration
node=dgx-4: Invalid argument
[2023-08-24T00:00:00.037] error: _slurm_rpc_node_registration
node=dgx-2: Invalid argument
[2023-08-24T00:00:00.216] error: _slurm_rpc_node_registration
node=titan-12: Invalid argument
[2023-08-24T00:00:00.216] error: _slurm_rpc_node_registration
node=titan-11: Invalid argument
[2023-08-24T00:00:00.266] error: _slurm_rpc_node_registration
node=dgx-6: Invalid argument
...
-----------------------------------------------------------

Googling, this appears to indicate that there is a resource mismatch
between the actual hardware and what is specified in slurm.conf. Note
that the existing configuration worked for Slurm 17, but I checked, and
it looks fine to me:

Relevant parts of slurm.conf:

-----------------------------------------------------------
    SchedulerType=sched/backfill
    SelectType=select/cons_res
    SelectTypeParameters=CR_Core_Memory

    PartitionName=titans Default=YES Nodes=titan-[3-15] State=UP
MaxTime=UNLIMITED
    PartitionName=dgx Nodes=dgx-[2-6] State=UP MaxTime=UNLIMITED

    GresTypes=gpu
    NodeName=titan-[3-15] Gres=gpu:titanv:8 RealMemory=250000 CPUs=40
    NodeName=dgx-2 Gres=gpu:tesla-v100:7 RealMemory=500000 CPUs=80
    NodeName=dgx-[3-6] Gres=gpu:tesla-v100:8 RealMemory=500000 CPUs=80
-----------------------------------------------------------

All the nodes in the titan partition are identical hardware, as are the
nodes in the dgx partition save for dgx-2, which lost a GPU and is no
longer under warranty.  So, using a couple of representative nodes:

root@dgx-4:~# slurmd -C
NodeName=dgx-4 CPUs=80 Boards=1 SocketsPerBoard=2 CoresPerSocket=20
ThreadsPerCore=2 RealMemory=515846

root@titan-8:~# slurmd -C
NodeName=titan-8 CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=10
ThreadsPerCore=2 RealMemory=257811


I'm at a loss for how to debug this and am looking suggestions. Since
the resources on these machines are strictly dedicated to Slurm jobs,
would it be best to use the output of `slurmd -C` directly for the right
hand side of NodeName, reducing the memory a bit for OS overhead? Is
there any way to get better debugging output? "Invalid argument" doesn't
tell me much.

Thanks.

This message is from an external sender. Learn more about why thismatters.<https://ut.service-now.com/sp?id=kb_article&number=KB0011401>

This message is from an external sender. Learn more about why this <<
matters at https://links.utexas.edu/rtyclf.                        <<

Re: [slurm-users] Nodes stay drained no matter what I do

Reply via email to