[slurm-users] Nodes stay drained no matter what I do

Patrick Goetz Thu, 24 Aug 2023 08:28:51 -0700


Master/Nodes: Ubuntu 20.04, Slurm 19.05.5 (as packaged by Debian)

This is an upgrade from a working Ubuntu 18.04/Slurm 17.x system where Ire-used the original slurm.conf (fearing this might cause issues). Thehardware is the same. The Master and nodes all use the same slurm.conf,gres.conf, and cgroup.conf files which are soft linked into/etc/slurm-llnl from an NFS mounted filesystem.


As per the subject, the nodes refuse to revert to idle:

-----------------------------------------------------------
root@hypnotoad:~# sinfo -N -l
Thu Aug 24 10:01:20 2023

NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISKWEIGHT AVAIL_FE REASONdgx-2 1 dgx drained 80 80:1:1 500000 01 (null) gres/gpu count repordgx-3 1 dgx drained 80 80:1:1 500000 01 (null) gres/gpu count repordgx-4 1 dgx drained 80 80:1:1 500000 01 (null) gres/gpu count

...

titan-3 1 titans* drained 40 40:1:1 250000 01 (null) gres/gpu count report

...
-----------------------------------------------------------

Neither of these commands has any effect:

  scontrol update NodeName=dgx-[2-6] State=RESUME
  scontrol update state=idle nodename=dgx-[2-6]


When I check the slurmctld log I find this helpful information:

-----------------------------------------------------------
...

[2023-08-24T00:00:00.033] error: _slurm_rpc_node_registrationnode=dgx-4: Invalid argument[2023-08-24T00:00:00.037] error: _slurm_rpc_node_registrationnode=dgx-2: Invalid argument[2023-08-24T00:00:00.216] error: _slurm_rpc_node_registrationnode=titan-12: Invalid argument[2023-08-24T00:00:00.216] error: _slurm_rpc_node_registrationnode=titan-11: Invalid argument[2023-08-24T00:00:00.266] error: _slurm_rpc_node_registrationnode=dgx-6: Invalid argument

...
-----------------------------------------------------------

Googling, this appears to indicate that there is a resource mismatchbetween the actual hardware and what is specified in slurm.conf. Notethat the existing configuration worked for Slurm 17, but I checked, andit looks fine to me:


Relevant parts of slurm.conf:

-----------------------------------------------------------
  SchedulerType=sched/backfill
  SelectType=select/cons_res
  SelectTypeParameters=CR_Core_Memory

PartitionName=titans Default=YES Nodes=titan-[3-15] State=UPMaxTime=UNLIMITED

  PartitionName=dgx Nodes=dgx-[2-6] State=UP MaxTime=UNLIMITED

  GresTypes=gpu
  NodeName=titan-[3-15] Gres=gpu:titanv:8 RealMemory=250000 CPUs=40
  NodeName=dgx-2 Gres=gpu:tesla-v100:7 RealMemory=500000 CPUs=80
  NodeName=dgx-[3-6] Gres=gpu:tesla-v100:8 RealMemory=500000 CPUs=80
-----------------------------------------------------------

All the nodes in the titan partition are identical hardware, as are thenodes in the dgx partition save for dgx-2, which lost a GPU and is nolonger under warranty. So, using a couple of representative nodes:


root@dgx-4:~# slurmd -C

NodeName=dgx-4 CPUs=80 Boards=1 SocketsPerBoard=2 CoresPerSocket=20ThreadsPerCore=2 RealMemory=515846


root@titan-8:~# slurmd -C

NodeName=titan-8 CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=10ThreadsPerCore=2 RealMemory=257811

I'm at a loss for how to debug this and am looking suggestions. Sincethe resources on these machines are strictly dedicated to Slurm jobs,would it be best to use the output of `slurmd -C` directly for the righthand side of NodeName, reducing the memory a bit for OS overhead? Isthere any way to get better debugging output? "Invalid argument" doesn'ttell me much.


Thanks.

[slurm-users] Nodes stay drained no matter what I do

Reply via email to