I am building a cluster exclusively with dynamic nodes, which all boot up over the network from the same system image (Debian 12); so far there is just one physical node, as well as a vm that I have used for the initial tests:

# sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
all*         up   infinite      1  inval gpu18c04d858b05
all*         up   infinite      1  down* node080027aea419

When I compare what the master node thinks of gpu18c04d858b05 with what the node itself reports, they seem to agree:

On gpu18c04d858b05:

root@gpu18c04d858b05:~# slurmd -C
NodeName=gpu18c04d858b05 CPUs=16 Boards=1 SocketsPerBoard=1 CoresPerSocket=8 ThreadsPerCore=2 RealMemory=64240
UpTime=0-18:04:06

And on the master:

# scontrol show node gpu18c04d858b05
NodeName=gpu18c04d858b05 Arch=x86_64 CoresPerSocket=8
   CPUAlloc=0 CPUEfctv=16 CPUTot=16 CPULoad=0.16
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=gpu:geforce:1
   NodeAddr=192.168.50.68 NodeHostName=gpu18c04d858b05 Version=23.02.3
OS=Linux 6.1.0-9-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.27-1 (2023-05-08)
   RealMemory=64240 AllocMem=0 FreeMem=63739 Sockets=1 Boards=1
State=DOWN+DRAIN+DYNAMIC_NORM+INVALID_REG ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=all
   BootTime=2023-08-31T15:25:55 SlurmdStartTime=2023-08-31T15:26:20
   LastBusyTime=2023-08-31T10:24:01 ResumeAfterTime=None
   CfgTRES=cpu=16,mem=64240M,billing=16
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=hang [root@2023-08-31T16:38:27]

I tried to fix it with:

# scontrol update nodename=gpu18c04d858b05 state=down reason=hang
# scontrol update nodename=gpu18c04d858b05 state=resume

However, that made no difference; what is the next step in troubleshooting this issue?

Reply via email to