[slurm-users] How to fix a node in state=inval?

Jan Andersen Fri, 01 Sep 2023 03:13:38 -0700

I am building a cluster exclusively with dynamic nodes, which all bootup over the network from the same system image (Debian 12); so far thereis just one physical node, as well as a vm that I have used for theinitial tests:


# sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
all*         up   infinite      1  inval gpu18c04d858b05
all*         up   infinite      1  down* node080027aea419

When I compare what the master node thinks of gpu18c04d858b05 with whatthe node itself reports, they seem to agree:


On gpu18c04d858b05:

root@gpu18c04d858b05:~# slurmd -C

NodeName=gpu18c04d858b05 CPUs=16 Boards=1 SocketsPerBoard=1CoresPerSocket=8 ThreadsPerCore=2 RealMemory=64240

UpTime=0-18:04:06

And on the master:

# scontrol show node gpu18c04d858b05
NodeName=gpu18c04d858b05 Arch=x86_64 CoresPerSocket=8
   CPUAlloc=0 CPUEfctv=16 CPUTot=16 CPULoad=0.16
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=gpu:geforce:1
   NodeAddr=192.168.50.68 NodeHostName=gpu18c04d858b05 Version=23.02.3

OS=Linux 6.1.0-9-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.27-1(2023-05-08)

   RealMemory=64240 AllocMem=0 FreeMem=63739 Sockets=1 Boards=1

State=DOWN+DRAIN+DYNAMIC_NORM+INVALID_REG ThreadsPerCore=2 TmpDisk=0Weight=1 Owner=N/A MCS_label=N/A

   Partitions=all
   BootTime=2023-08-31T15:25:55 SlurmdStartTime=2023-08-31T15:26:20
   LastBusyTime=2023-08-31T10:24:01 ResumeAfterTime=None
   CfgTRES=cpu=16,mem=64240M,billing=16
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=hang [root@2023-08-31T16:38:27]

I tried to fix it with:

# scontrol update nodename=gpu18c04d858b05 state=down reason=hang
# scontrol update nodename=gpu18c04d858b05 state=resume

However, that made no difference; what is the next step introubleshooting this issue?

[slurm-users] How to fix a node in state=inval?

Reply via email to