Hi Rafal, How do you restart the nodes? If you don’t use scontrol reboot <node> Slurm doesn’t expect nodes to reboot therefore you see that reason in those cases.
Best Andreas Am 27.09.2019 um 07:53 schrieb Rafał Kędziorski <rafal.kedzior...@gmail.com<mailto:rafal.kedzior...@gmail.com>>: Hi, I'm working with slurm-wlm 18.08.5-2 on Raspberry Pi Cluster: - 1 Pi 4 as manager - 4 Pi 4 nodes This work fine. But after every restart of the nodes I get this cluster@pi-manager:~ $ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST devcluster* up infinite 4 down pi-4-node-[1-4] state. Than I can call sudo scontrol update NodeName=<node_name> State=RESUME for every node and sometimes are all nodes idle and some down cluster @pi-manager:~ $ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST devcluster* up infinite 2 idle pi-4-node-[1-2] devcluster* up infinite 2 down pi-4-node-[3-4] Status to all nodes cluster@pi-manager:~ $ scontrol show nodes NodeName=pi-4-node-1 Arch=armv7l CoresPerSocket=1 CPUAlloc=0 CPUTot=4 CPULoad=0.24 AvailableFeatures=(null) ActiveFeatures=(null) Gres=(null) NodeAddr=192.168.178.141 NodeHostName=pi-4-node-1 Version=18.08 OS=Linux 4.19.66-v7l+ #1253 SMP Thu Aug 15 12:02:08 BST 2019 RealMemory=1 AllocMem=0 FreeMem=3687 Sockets=4 Boards=1 State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=devcluster BootTime=2019-09-19T17:38:58 SlurmdStartTime=2019-09-19T00:26:36 CfgTRES=cpu=4,mem=1M,billing=4 AllocTRES= CapWatts=n/a CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s NodeName=pi-4-node-2 Arch=armv7l CoresPerSocket=1 CPUAlloc=0 CPUTot=4 CPULoad=0.06 AvailableFeatures=(null) ActiveFeatures=(null) Gres=(null) NodeAddr=192.168.178.142 NodeHostName=pi-4-node-2 Version=18.08 OS=Linux 4.19.66-v7l+ #1253 SMP Thu Aug 15 12:02:08 BST 2019 RealMemory=1 AllocMem=0 FreeMem=3687 Sockets=4 Boards=1 State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=devcluster BootTime=2019-09-19T17:38:57 SlurmdStartTime=2019-09-19T00:26:49 CfgTRES=cpu=4,mem=1M,billing=4 AllocTRES= CapWatts=n/a CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s NodeName=pi-4-node-3 Arch=armv7l CoresPerSocket=1 CPUAlloc=0 CPUTot=4 CPULoad=0.02 AvailableFeatures=(null) ActiveFeatures=(null) Gres=(null) NodeAddr=192.168.178.143 NodeHostName=pi-4-node-3 Version=18.08 OS=Linux 4.19.66-v7l+ #1253 SMP Thu Aug 15 12:02:08 BST 2019 RealMemory=1 AllocMem=0 FreeMem=3676 Sockets=4 Boards=1 State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=devcluster BootTime=2019-09-19T17:38:55 SlurmdStartTime=2019-09-19T00:26:45 CfgTRES=cpu=4,mem=1M,billing=4 AllocTRES= CapWatts=n/a CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Reason=Node unexpectedly rebooted [slurm@2019-09-19T17:39:32] NodeName=pi-4-node-4 Arch=armv7l CoresPerSocket=1 CPUAlloc=0 CPUTot=4 CPULoad=0.02 AvailableFeatures=(null) ActiveFeatures=(null) Gres=(null) NodeAddr=192.168.178.144 NodeHostName=pi-4-node-4 Version=18.08 OS=Linux 4.19.66-v7l+ #1253 SMP Thu Aug 15 12:02:08 BST 2019 RealMemory=1 AllocMem=0 FreeMem=3687 Sockets=4 Boards=1 State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=devcluster BootTime=2019-09-19T17:38:52 SlurmdStartTime=2019-09-19T00:26:47 CfgTRES=cpu=4,mem=1M,billing=4 AllocTRES= CapWatts=n/a CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Reason=Node unexpectedly rebooted [slurm@2019-09-19T17:39:30] NodeName=pi-manager Arch=armv7l CoresPerSocket=1 CPUAlloc=0 CPUTot=4 CPULoad=0.00 AvailableFeatures=(null) ActiveFeatures=(null) Gres=(null) NodeAddr=192.168.178.140 NodeHostName=pi-manager Version=18.08 OS=Linux 4.19.66-v7l+ #1253 SMP Thu Aug 15 12:02:08 BST 2019 RealMemory=1 AllocMem=0 FreeMem=3446 Sockets=4 Boards=1 State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A BootTime=2019-09-19T17:35:48 SlurmdStartTime=2019-09-19T08:10:51 CfgTRES=cpu=4,mem=1M,billing=4 AllocTRES= CapWatts=n/a CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Nodes which are down, the Reason is: Reason=Node unexpectedly rebooted [slurm@2019-09-19T17:39:30] What is the problem? But my Nodes in the Cluster are not running whole time. Regards, Rafal