Baja El El jue, 26 nov. 2020 a la(s) 15:40, < slurm-users-requ...@lists.schedmd.com> escribió:
> Send slurm-users mailing list submissions to > slurm-users@lists.schedmd.com > > To subscribe or unsubscribe via the World Wide Web, visit > https://lists.schedmd.com/cgi-bin/mailman/listinfo/slurm-users > or, via email, send a message with subject or body 'help' to > slurm-users-requ...@lists.schedmd.com > > You can reach the person managing the list at > slurm-users-ow...@lists.schedmd.com > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of slurm-users digest..." > > > Today's Topics: > > 1. Re: [EXTERNAL] Re: trying to diagnose a connectivity issue > between the slurmctld process and the slurmd nodes (Steve Bland) > 2. Re: [EXTERNAL] Re: trying to diagnose a connectivity issue > between the slurmctld process and the slurmd nodes (Andy Riebs) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Thu, 26 Nov 2020 18:01:25 +0000 > From: Steve Bland <sbl...@rossvideo.com> > To: "a...@candooz.com" <a...@candooz.com>, Slurm User Community List > <slurm-users@lists.schedmd.com> > Subject: Re: [slurm-users] [EXTERNAL] Re: trying to diagnose a > connectivity issue between the slurmctld process and the slurmd > nodes > Message-ID: > < > ytxpr0101mb2302a3f22023838fb5745ea2cf...@ytxpr0101mb2302.canprd01.prod.outlook.com > > > > Content-Type: text/plain; charset="us-ascii" > > Thanks Andy > > Firewall is off on all three system. Also if they could not communicate, I > do not think 'scontrol show node' would not return the data that is does. > And the logs would not show responses as indicated below > > And the names are correct, used the recommended 'hostname -s' when > configuring the slurm.conf node entries. > In fact slurm seems to be case sensitive, which surprised the heck out of > me > > > > From: slurm-users <slurm-users-boun...@lists.schedmd.com> On Behalf Of > Andy Riebs > Sent: Thursday, November 26, 2020 12:50 > To: slurm-users@lists.schedmd.com > Subject: [EXTERNAL] Re: [slurm-users] trying to diagnose a connectivity > issue between the slurmctld process and the slurmd nodes > > > 1. Look for a firewall on all of your slurm -- they almost always break > slurm communications. > 2. Confirm that "ssh srvgridslurm01 hostname" returns, exactly, > "srvgridslurm01" > > Andy > On 11/26/2020 12:21 PM, Steve Bland wrote: > > Sinfo always returns nodes not responding > [root@srvgridslurm03 ~]# sinfo -R > REASON USER TIMESTAMP NODELIST > Not responding slurm 2020-11-26T09:12:58 SRVGRIDSLURM01 > Not responding slurm 2020-11-26T08:27:58 SRVGRIDSLURM02 > Not responding slurm 2020-11-26T10:00:14 srvgridslurm03 > > > By tailing the log for slurmctld, I can see when a node is recognized > Node srvgridslurm03 now responding > > > By turning up the logging levels I can see comm between slurmctld and the > nodes and there appears to be a response > > [2020-11-26T12:05:14.333] debug3: Tree sending to SRVGRIDSLURM01 > [2020-11-26T12:05:14.333] debug2: Tree head got back 0 looking for 3 > [2020-11-26T12:05:14.333] debug3: Tree sending to SRVGRIDSLURM02 > [2020-11-26T12:05:14.333] debug3: Tree sending to srvgridslurm03 > [2020-11-26T12:05:14.335] debug2: Tree head got back 1 > [2020-11-26T12:05:14.335] debug2: Tree head got back 2 > [2020-11-26T12:05:14.336] debug2: Tree head got back 3 > [2020-11-26T12:05:14.338] debug2: node_did_resp SRVGRIDSLURM01 > [2020-11-26T12:05:14.338] debug2: node_did_resp SRVGRIDSLURM02 > [2020-11-26T12:05:14.338] debug2: node_did_resp srvgridslurm03 > > What I do not understand is the disjoint. It seems to record responses, > but flags the node as not responding - all nodes. There are only three > right now as this is a test environment. 3 Centos7 systems > > [root@SRVGRIDSLURM01 ~]# scontrol show node > NodeName=SRVGRIDSLURM01 Arch=x86_64 CoresPerSocket=4 > CPUAlloc=0 CPUTot=4 CPULoad=0.01 > AvailableFeatures=(null) > ActiveFeatures=(null) > Gres=(null) > NodeAddr=SRVGRIDSLURM01 NodeHostName=SRVGRIDSLURM01 Version=20.11.0 > OS=Linux 3.10.0-1160.2.2.el7.x86_64 #1 SMP Tue Oct 20 16:53:08 UTC 2020 > RealMemory=7821 AllocMem=0 FreeMem=5211 Sockets=1 Boards=1 > State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A > Partitions=debug > BootTime=2020-11-24T08:04:25 SlurmdStartTime=2020-11-26T11:38:25 > CfgTRES=cpu=4,mem=7821M,billing=4 > AllocTRES= > CapWatts=n/a > CurrentWatts=0 AveWatts=0 > ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s > Reason=Not responding [slurm@2020-11-26T09:12:58] > Comment=(null) > > NodeName=SRVGRIDSLURM02 Arch=x86_64 CoresPerSocket=4 > CPUAlloc=0 CPUTot=4 CPULoad=0.01 > AvailableFeatures=(null) > ActiveFeatures=(null) > Gres=(null) > NodeAddr=SRVGRIDSLURM02 NodeHostName=SRVGRIDSLURM02 Version=20.11.0 > OS=Linux 3.10.0-1160.2.2.el7.x86_64 #1 SMP Tue Oct 20 16:53:08 UTC 2020 > RealMemory=7821 AllocMem=0 FreeMem=6900 Sockets=1 Boards=1 > State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A > Partitions=debug > BootTime=2020-11-24T08:04:32 SlurmdStartTime=2020-11-26T10:31:08 > CfgTRES=cpu=4,mem=7821M,billing=4 > AllocTRES= > CapWatts=n/a > CurrentWatts=0 AveWatts=0 > ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s > Reason=Not responding [slurm@2020-11-26T08:27:58] > Comment=(null) > > NodeName=srvgridslurm03 Arch=x86_64 CoresPerSocket=4 > CPUAlloc=0 CPUTot=4 CPULoad=0.01 > AvailableFeatures=(null) > ActiveFeatures=(null) > Gres=(null) > NodeAddr=srvgridslurm03 NodeHostName=srvgridslurm03 Version=20.11.0 > OS=Linux 3.10.0-1160.2.2.el7.x86_64 #1 SMP Tue Oct 20 16:53:08 UTC 2020 > RealMemory=7821 AllocMem=0 FreeMem=7170 Sockets=1 Boards=1 > State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A > Partitions=debug > BootTime=2020-11-26T09:46:49 SlurmdStartTime=2020-11-26T11:55:23 > CfgTRES=cpu=4,mem=7821M,billing=4 > AllocTRES= > CapWatts=n/a > CurrentWatts=0 AveWatts=0 > ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s > Reason=Not responding [slurm@2020-11-26T10:00:14] > Comment=(null) > > Any suggestions? Thanks > > > ---------------------------------------------- > > This e-mail and any attachments may contain information that is > confidential to Ross Video. > > If you are not the intended recipient, please notify me immediately by > replying to this message. Please also delete all copies. Thank you. > ---------------------------------------------- > > This e-mail and any attachments may contain information that is > confidential to Ross Video. > > If you are not the intended recipient, please notify me immediately by > replying to this message. Please also delete all copies. Thank you. > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: < > http://lists.schedmd.com/pipermail/slurm-users/attachments/20201126/cc5da04d/attachment-0001.htm > > > > ------------------------------ > > Message: 2 > Date: Thu, 26 Nov 2020 13:40:24 -0500 > From: Andy Riebs <a...@candooz.com> > To: Steve Bland <sbl...@rossvideo.com>, Slurm User Community List > <slurm-users@lists.schedmd.com> > Subject: Re: [slurm-users] [EXTERNAL] Re: trying to diagnose a > connectivity issue between the slurmctld process and the slurmd > nodes > Message-ID: <cdd891a8-bcff-8cc7-6b40-5854a8095...@candooz.com> > Content-Type: text/plain; charset="windows-1252"; Format="flowed" > > One last shot on the firewall front Steve -- does the control node have > a firewall enabled? I've seen cases where that can cause the sporadic > messaging failures that you seem to be seeing. > > That failing, I'll defer to anyone with different ideas! > > Andy > > On 11/26/2020 1:01 PM, Steve Bland wrote: > > > > Thanks Andy > > > > Firewall is off on all three system. Also if they could not > > communicate, I do not think ?scontrol show node? would not return the > > data that is does. And the logs would not show responses as indicated > > below > > > > And the names are correct, used the recommended ?hostname -s? when > > configuring the slurm.conf node entries. > > > > In fact slurm seems to be case sensitive, which surprised the heck out > > of me > > > > *From:* slurm-users <slurm-users-boun...@lists.schedmd.com> *On Behalf > > Of *Andy Riebs > > *Sent:* Thursday, November 26, 2020 12:50 > > *To:* slurm-users@lists.schedmd.com > > *Subject:* [EXTERNAL] Re: [slurm-users] trying to diagnose a > > connectivity issue between the slurmctld process and the slurmd nodes > > > > 1. Look for a firewall on all of your slurm -- they almost always > > break slurm communications. > > 2. Confirm that "ssh srvgridslurm01 hostname" returns, exactly, > > "srvgridslurm01" > > > > Andy > > > > On 11/26/2020 12:21 PM, Steve Bland wrote: > > > > Sinfo always returns nodes not responding > > > > [root@srvgridslurm03 ~]# sinfo -R > > > > REASON?????????????? USER TIMESTAMP?????????? NODELIST > > > > Not responding?????? slurm 2020-11-26T09:12:58 SRVGRIDSLURM01 > > > > Not responding?????? slurm 2020-11-26T08:27:58 SRVGRIDSLURM02 > > > > Not responding?????? slurm 2020-11-26T10:00:14 srvgridslurm03 > > > > By tailing the log for slurmctld, ?I can see when a node is > recognized > > > > Node srvgridslurm03 now responding > > > > By turning up the logging levels I can see comm between slurmctld > > and the nodes and there appears to be a response > > > > [2020-11-26T12:05:14.333] debug3: Tree sending to SRVGRIDSLURM01 > > > > [2020-11-26T12:05:14.333] debug2: Tree head got back 0 looking for 3 > > > > [2020-11-26T12:05:14.333] debug3: Tree sending to SRVGRIDSLURM02 > > > > [2020-11-26T12:05:14.333] debug3: Tree sending to srvgridslurm03 > > > > [2020-11-26T12:05:14.335] debug2: Tree head got back 1 > > > > [2020-11-26T12:05:14.335] debug2: Tree head got back 2 > > > > [2020-11-26T12:05:14.336] debug2: Tree head got back 3 > > > > [2020-11-26T12:05:14.338] debug2: node_did_resp SRVGRIDSLURM01 > > > > [2020-11-26T12:05:14.338] debug2: node_did_resp SRVGRIDSLURM02 > > > > [2020-11-26T12:05:14.338] debug2: node_did_resp srvgridslurm03 > > > > What I do not understand is the disjoint. It seems to record > > responses, but flags the node as not responding ? all nodes. There > > are only three right now as this is a test environment. 3 Centos7 > > systems > > > > [root@SRVGRIDSLURM01 ~]# scontrol show node > > > > NodeName=SRVGRIDSLURM01 Arch=x86_64 CoresPerSocket=4 > > > > ?? CPUAlloc=0 CPUTot=4 CPULoad=0.01 > > > > ?? AvailableFeatures=(null) > > > > ?? ActiveFeatures=(null) > > > > ?? Gres=(null) > > > > ?? NodeAddr=SRVGRIDSLURM01 NodeHostName=SRVGRIDSLURM01 > Version=20.11.0 > > > > ?? OS=Linux 3.10.0-1160.2.2.el7.x86_64 #1 SMP Tue Oct 20 16:53:08 > > UTC 2020 > > > > ?? RealMemory=7821 AllocMem=0 FreeMem=5211 Sockets=1 Boards=1 > > > > ?? State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A > > MCS_label=N/A > > > > ?? Partitions=debug > > > > ?? BootTime=2020-11-24T08:04:25 SlurmdStartTime=2020-11-26T11:38:25 > > > > ?? CfgTRES=cpu=4,mem=7821M,billing=4 > > > > ?? AllocTRES= > > > > ?? CapWatts=n/a > > > > ?? CurrentWatts=0 AveWatts=0 > > > > ?? ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s > > > > ?? Reason=Not responding [slurm@2020-11-26T09:12:58] > > > > ?? Comment=(null) > > > > NodeName=SRVGRIDSLURM02 Arch=x86_64 CoresPerSocket=4 > > > > ?? CPUAlloc=0 CPUTot=4 CPULoad=0.01 > > > > ?? AvailableFeatures=(null) > > > > ?? ActiveFeatures=(null) > > > > ?? Gres=(null) > > > > ?? NodeAddr=SRVGRIDSLURM02 NodeHostName=SRVGRIDSLURM02 > Version=20.11.0 > > > > ?? OS=Linux 3.10.0-1160.2.2.el7.x86_64 #1 SMP Tue Oct 20 16:53:08 > > UTC 2020 > > > > ?? RealMemory=7821 AllocMem=0 FreeMem=6900 Sockets=1 Boards=1 > > > > ?? State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A > > MCS_label=N/A > > > > ?? Partitions=debug > > > > ?? BootTime=2020-11-24T08:04:32 SlurmdStartTime=2020-11-26T10:31:08 > > > > ?? CfgTRES=cpu=4,mem=7821M,billing=4 > > > > ?? AllocTRES= > > > > ?? CapWatts=n/a > > > > ?? CurrentWatts=0 AveWatts=0 > > > > ?? ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s > > > > ?? Reason=Not responding [slurm@2020-11-26T08:27:58] > > > > ?? Comment=(null) > > > > NodeName=srvgridslurm03 Arch=x86_64 CoresPerSocket=4 > > > > ?? CPUAlloc=0 CPUTot=4 CPULoad=0.01 > > > > ?? AvailableFeatures=(null) > > > > ?? ActiveFeatures=(null) > > > > ?? Gres=(null) > > > > ?? NodeAddr=srvgridslurm03 NodeHostName=srvgridslurm03 > Version=20.11.0 > > > > ?? OS=Linux 3.10.0-1160.2.2.el7.x86_64 #1 SMP Tue Oct 20 16:53:08 > > UTC 2020 > > > > ?? RealMemory=7821 AllocMem=0 FreeMem=7170 Sockets=1 Boards=1 > > > > ?? State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A > > MCS_label=N/A > > > > ?? Partitions=debug > > > > ?? BootTime=2020-11-26T09:46:49 SlurmdStartTime=2020-11-26T11:55:23 > > > > ?? CfgTRES=cpu=4,mem=7821M,billing=4 > > > > ?? AllocTRES= > > > > ?? CapWatts=n/a > > > > ?? CurrentWatts=0 AveWatts=0 > > > > ?? ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s > > > > ?? Reason=Not responding [slurm@2020-11-26T10:00:14] > > > > ?? Comment=(null) > > > > Any suggestions? Thanks > > > > ---------------------------------------------- > > > > This e-mail and any attachments may contain information that is > > confidential to Ross Video. > > > > If you are not the intended recipient, please notify me > > immediately by replying to this message. Please also delete all > > copies. Thank you. > > > > ---------------------------------------------- > > > > This e-mail and any attachments may contain information that is > > confidential to Ross Video. > > > > If you are not the intended recipient, please notify me immediately by > > replying to this message. Please also delete all copies. Thank you. > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: < > http://lists.schedmd.com/pipermail/slurm-users/attachments/20201126/c5715744/attachment.htm > > > > End of slurm-users Digest, Vol 37, Issue 46 > ******************************************* > -- Veronica Chaul +5411 3581-4041