Are you starting the slurmd via 'slurmd -Z' on the dyanmic node? The next steps would be to check the slurmctld log from the master and slurmd log for the invalid node. Those should provide more insight into why the node is seen as invalid. If you can attach those we might be able to see the issue.
Regards, -- Willy Markuske HPC Systems Engineer MS Data Science and Engineering SDSC - Research Data Services (619) 519-4435 wmarku...@sdsc.edu On Sep 1, 2023, at 03:12, Jan Andersen <j...@comind.io> wrote: I am building a cluster exclusively with dynamic nodes, which all boot up over the network from the same system image (Debian 12); so far there is just one physical node, as well as a vm that I have used for the initial tests: # sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST all* up infinite 1 inval gpu18c04d858b05 all* up infinite 1 down* node080027aea419 When I compare what the master node thinks of gpu18c04d858b05 with what the node itself reports, they seem to agree: On gpu18c04d858b05: root@gpu18c04d858b05:~# slurmd -C NodeName=gpu18c04d858b05 CPUs=16 Boards=1 SocketsPerBoard=1 CoresPerSocket=8 ThreadsPerCore=2 RealMemory=64240 UpTime=0-18:04:06 And on the master: # scontrol show node gpu18c04d858b05 NodeName=gpu18c04d858b05 Arch=x86_64 CoresPerSocket=8 CPUAlloc=0 CPUEfctv=16 CPUTot=16 CPULoad=0.16 AvailableFeatures=(null) ActiveFeatures=(null) Gres=gpu:geforce:1 NodeAddr=192.168.50.68 NodeHostName=gpu18c04d858b05 Version=23.02.3 OS=Linux 6.1.0-9-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.27-1 (2023-05-08) RealMemory=64240 AllocMem=0 FreeMem=63739 Sockets=1 Boards=1 State=DOWN+DRAIN+DYNAMIC_NORM+INVALID_REG ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=all BootTime=2023-08-31T15:25:55 SlurmdStartTime=2023-08-31T15:26:20 LastBusyTime=2023-08-31T10:24:01 ResumeAfterTime=None CfgTRES=cpu=16,mem=64240M,billing=16 AllocTRES= CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Reason=hang [root@2023-08-31T16:38:27] I tried to fix it with: # scontrol update nodename=gpu18c04d858b05 state=down reason=hang # scontrol update nodename=gpu18c04d858b05 state=resume However, that made no difference; what is the next step in troubleshooting this issue?