Are you starting the slurmd via 'slurmd -Z' on the dyanmic node?

The next steps would be to check the slurmctld log from the master and slurmd 
log for the invalid node. Those should provide more insight into why the node 
is seen as invalid. If you can attach those we might be able to see the issue.

Regards,

--
Willy Markuske

HPC Systems Engineer
MS Data Science and Engineering
SDSC - Research Data Services
(619) 519-4435
wmarku...@sdsc.edu

On Sep 1, 2023, at 03:12, Jan Andersen <j...@comind.io> wrote:

I am building a cluster exclusively with dynamic nodes, which all boot up over 
the network from the same system image (Debian 12); so far there is just one 
physical node, as well as a vm that I have used for the initial tests:

# sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
all*         up   infinite      1  inval gpu18c04d858b05
all*         up   infinite      1  down* node080027aea419

When I compare what the master node thinks of gpu18c04d858b05 with what the 
node itself reports, they seem to agree:

On gpu18c04d858b05:

root@gpu18c04d858b05:~# slurmd -C
NodeName=gpu18c04d858b05 CPUs=16 Boards=1 SocketsPerBoard=1 CoresPerSocket=8 
ThreadsPerCore=2 RealMemory=64240
UpTime=0-18:04:06

And on the master:

# scontrol show node gpu18c04d858b05
NodeName=gpu18c04d858b05 Arch=x86_64 CoresPerSocket=8
  CPUAlloc=0 CPUEfctv=16 CPUTot=16 CPULoad=0.16
  AvailableFeatures=(null)
  ActiveFeatures=(null)
  Gres=gpu:geforce:1
  NodeAddr=192.168.50.68 NodeHostName=gpu18c04d858b05 Version=23.02.3
  OS=Linux 6.1.0-9-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.27-1 (2023-05-08)
  RealMemory=64240 AllocMem=0 FreeMem=63739 Sockets=1 Boards=1
  State=DOWN+DRAIN+DYNAMIC_NORM+INVALID_REG ThreadsPerCore=2 TmpDisk=0 Weight=1 
Owner=N/A MCS_label=N/A
  Partitions=all
  BootTime=2023-08-31T15:25:55 SlurmdStartTime=2023-08-31T15:26:20
  LastBusyTime=2023-08-31T10:24:01 ResumeAfterTime=None
  CfgTRES=cpu=16,mem=64240M,billing=16
  AllocTRES=
  CapWatts=n/a
  CurrentWatts=0 AveWatts=0
  ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
  Reason=hang [root@2023-08-31T16:38:27]

I tried to fix it with:

# scontrol update nodename=gpu18c04d858b05 state=down reason=hang
# scontrol update nodename=gpu18c04d858b05 state=resume

However, that made no difference; what is the next step in troubleshooting this 
issue?


Reply via email to