Hi Patrick, You may want to review the release notes for 19.05 and any intermediate versions:
https://github.com/SchedMD/slurm/blob/slurm-19-05-5-1/RELEASE_NOTES https://github.com/SchedMD/slurm/blob/slurm-18-08-9-1/RELEASE_NOTES I'd also check the slurmd.log on the compute nodes. It's usually in /var/log/slurm/slurmd.log I'm not 100% sure your gres.conf is correct, we use one gres.conf for all our nodes, it looks something like this: NodeName=gpu-[1,2] Name=gpu Type=teslaM40 File=/dev/nvidia[0-3] NodeName=gpu-[3,6] Name=gpu Type=teslaK80 File=/dev/nvidia[0-7] NodeName=gpu-[7-9] Name=gpu Type=teslaV100 File=/dev/nvidia[0-3] SchedMd docs example is a little different as they have a unique gres.conf by node in their example at: https://github.com/SchedMD/slurm/blob/slurm-19-05-5-1/doc/man/man5/gres.conf.5 Name=gpu Type=gtx560 File=/dev/nvidia0 COREs=0,1 I don't see Name in your gres.conf? Kind regards -- Mick Timony Senior DevOps Engineer Harvard Medical School -- ________________________________ From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of Patrick Goetz <pgo...@math.utexas.edu> Sent: Thursday, August 24, 2023 11:27 AM To: Slurm User Community List <slurm-users@lists.schedmd.com> Subject: [slurm-users] Nodes stay drained no matter what I do Master/Nodes: Ubuntu 20.04, Slurm 19.05.5 (as packaged by Debian) This is an upgrade from a working Ubuntu 18.04/Slurm 17.x system where I re-used the original slurm.conf (fearing this might cause issues). The hardware is the same. The Master and nodes all use the same slurm.conf, gres.conf, and cgroup.conf files which are soft linked into /etc/slurm-llnl from an NFS mounted filesystem. As per the subject, the nodes refuse to revert to idle: ----------------------------------------------------------- root@hypnotoad:~# sinfo -N -l Thu Aug 24 10:01:20 2023 NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON dgx-2 1 dgx drained 80 80:1:1 500000 0 1 (null) gres/gpu count repor dgx-3 1 dgx drained 80 80:1:1 500000 0 1 (null) gres/gpu count repor dgx-4 1 dgx drained 80 80:1:1 500000 0 1 (null) gres/gpu count ... titan-3 1 titans* drained 40 40:1:1 250000 0 1 (null) gres/gpu count report ... ----------------------------------------------------------- Neither of these commands has any effect: scontrol update NodeName=dgx-[2-6] State=RESUME scontrol update state=idle nodename=dgx-[2-6] When I check the slurmctld log I find this helpful information: ----------------------------------------------------------- ... [2023-08-24T00:00:00.033] error: _slurm_rpc_node_registration node=dgx-4: Invalid argument [2023-08-24T00:00:00.037] error: _slurm_rpc_node_registration node=dgx-2: Invalid argument [2023-08-24T00:00:00.216] error: _slurm_rpc_node_registration node=titan-12: Invalid argument [2023-08-24T00:00:00.216] error: _slurm_rpc_node_registration node=titan-11: Invalid argument [2023-08-24T00:00:00.266] error: _slurm_rpc_node_registration node=dgx-6: Invalid argument ... ----------------------------------------------------------- Googling, this appears to indicate that there is a resource mismatch between the actual hardware and what is specified in slurm.conf. Note that the existing configuration worked for Slurm 17, but I checked, and it looks fine to me: Relevant parts of slurm.conf: ----------------------------------------------------------- SchedulerType=sched/backfill SelectType=select/cons_res SelectTypeParameters=CR_Core_Memory PartitionName=titans Default=YES Nodes=titan-[3-15] State=UP MaxTime=UNLIMITED PartitionName=dgx Nodes=dgx-[2-6] State=UP MaxTime=UNLIMITED GresTypes=gpu NodeName=titan-[3-15] Gres=gpu:titanv:8 RealMemory=250000 CPUs=40 NodeName=dgx-2 Gres=gpu:tesla-v100:7 RealMemory=500000 CPUs=80 NodeName=dgx-[3-6] Gres=gpu:tesla-v100:8 RealMemory=500000 CPUs=80 ----------------------------------------------------------- All the nodes in the titan partition are identical hardware, as are the nodes in the dgx partition save for dgx-2, which lost a GPU and is no longer under warranty. So, using a couple of representative nodes: root@dgx-4:~# slurmd -C NodeName=dgx-4 CPUs=80 Boards=1 SocketsPerBoard=2 CoresPerSocket=20 ThreadsPerCore=2 RealMemory=515846 root@titan-8:~# slurmd -C NodeName=titan-8 CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 ThreadsPerCore=2 RealMemory=257811 I'm at a loss for how to debug this and am looking suggestions. Since the resources on these machines are strictly dedicated to Slurm jobs, would it be best to use the output of `slurmd -C` directly for the right hand side of NodeName, reducing the memory a bit for OS overhead? Is there any way to get better debugging output? "Invalid argument" doesn't tell me much. Thanks.