Sinfo always returns nodes not responding [root@srvgridslurm03 ~]# sinfo -R REASON USER TIMESTAMP NODELIST Not responding slurm 2020-11-26T09:12:58 SRVGRIDSLURM01 Not responding slurm 2020-11-26T08:27:58 SRVGRIDSLURM02 Not responding slurm 2020-11-26T10:00:14 srvgridslurm03
By tailing the log for slurmctld, I can see when a node is recognized Node srvgridslurm03 now responding By turning up the logging levels I can see comm between slurmctld and the nodes and there appears to be a response [2020-11-26T12:05:14.333] debug3: Tree sending to SRVGRIDSLURM01 [2020-11-26T12:05:14.333] debug2: Tree head got back 0 looking for 3 [2020-11-26T12:05:14.333] debug3: Tree sending to SRVGRIDSLURM02 [2020-11-26T12:05:14.333] debug3: Tree sending to srvgridslurm03 [2020-11-26T12:05:14.335] debug2: Tree head got back 1 [2020-11-26T12:05:14.335] debug2: Tree head got back 2 [2020-11-26T12:05:14.336] debug2: Tree head got back 3 [2020-11-26T12:05:14.338] debug2: node_did_resp SRVGRIDSLURM01 [2020-11-26T12:05:14.338] debug2: node_did_resp SRVGRIDSLURM02 [2020-11-26T12:05:14.338] debug2: node_did_resp srvgridslurm03 What I do not understand is the disjoint. It seems to record responses, but flags the node as not responding - all nodes. There are only three right now as this is a test environment. 3 Centos7 systems [root@SRVGRIDSLURM01 ~]# scontrol show node NodeName=SRVGRIDSLURM01 Arch=x86_64 CoresPerSocket=4 CPUAlloc=0 CPUTot=4 CPULoad=0.01 AvailableFeatures=(null) ActiveFeatures=(null) Gres=(null) NodeAddr=SRVGRIDSLURM01 NodeHostName=SRVGRIDSLURM01 Version=20.11.0 OS=Linux 3.10.0-1160.2.2.el7.x86_64 #1 SMP Tue Oct 20 16:53:08 UTC 2020 RealMemory=7821 AllocMem=0 FreeMem=5211 Sockets=1 Boards=1 State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=debug BootTime=2020-11-24T08:04:25 SlurmdStartTime=2020-11-26T11:38:25 CfgTRES=cpu=4,mem=7821M,billing=4 AllocTRES= CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Reason=Not responding [slurm@2020-11-26T09:12:58] Comment=(null) NodeName=SRVGRIDSLURM02 Arch=x86_64 CoresPerSocket=4 CPUAlloc=0 CPUTot=4 CPULoad=0.01 AvailableFeatures=(null) ActiveFeatures=(null) Gres=(null) NodeAddr=SRVGRIDSLURM02 NodeHostName=SRVGRIDSLURM02 Version=20.11.0 OS=Linux 3.10.0-1160.2.2.el7.x86_64 #1 SMP Tue Oct 20 16:53:08 UTC 2020 RealMemory=7821 AllocMem=0 FreeMem=6900 Sockets=1 Boards=1 State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=debug BootTime=2020-11-24T08:04:32 SlurmdStartTime=2020-11-26T10:31:08 CfgTRES=cpu=4,mem=7821M,billing=4 AllocTRES= CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Reason=Not responding [slurm@2020-11-26T08:27:58] Comment=(null) NodeName=srvgridslurm03 Arch=x86_64 CoresPerSocket=4 CPUAlloc=0 CPUTot=4 CPULoad=0.01 AvailableFeatures=(null) ActiveFeatures=(null) Gres=(null) NodeAddr=srvgridslurm03 NodeHostName=srvgridslurm03 Version=20.11.0 OS=Linux 3.10.0-1160.2.2.el7.x86_64 #1 SMP Tue Oct 20 16:53:08 UTC 2020 RealMemory=7821 AllocMem=0 FreeMem=7170 Sockets=1 Boards=1 State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=debug BootTime=2020-11-26T09:46:49 SlurmdStartTime=2020-11-26T11:55:23 CfgTRES=cpu=4,mem=7821M,billing=4 AllocTRES= CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Reason=Not responding [slurm@2020-11-26T10:00:14] Comment=(null) Any suggestions? Thanks ---------------------------------------------- This e-mail and any attachments may contain information that is confidential to Ross Video. If you are not the intended recipient, please notify me immediately by replying to this message. Please also delete all copies. Thank you.