Dear Robbert, Thankyou so much for your response. I was so focused on sync of time that I missed the date on one of the nodes which was 1 day behind as you said. I have corrected it and now i get the following output in status.
*(base) [nousheen@nousheen slurm]$ systemctl status slurmctld.service -l* ● slurmctld.service - Slurm controller daemon Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled; vendor preset: disabled) Active: active (running) since Thu 2022-12-01 21:37:34 PKT; 20min ago Main PID: 19475 (slurmctld) Tasks: 10 Memory: 4.5M CGroup: /system.slice/slurmctld.service ├─19475 /usr/sbin/slurmctld -D -s └─19538 slurmctld: slurmscriptd Dec 01 21:47:08 nousheen slurmctld[19475]: slurmctld: sched/backfill: _start_job: Started JobId=106 in debug on 101 Dec 01 21:47:09 nousheen slurmctld[19475]: slurmctld: _job_complete: JobId=106 WEXITSTATUS 1 Dec 01 21:47:09 nousheen slurmctld[19475]: slurmctld: _job_complete: JobId=106 done Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: sched: Allocate JobId=107 NodeList=101 #CPUs=8 Partition=debug Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: sched: Allocate JobId=108 NodeList=105 #CPUs=8 Partition=debug Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: sched: Allocate JobId=109 NodeList=nousheen #CPUs=8 Partition=debug Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: _job_complete: JobId=107 WEXITSTATUS 1 Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: _job_complete: JobId=107 done Dec 01 21:47:12 nousheen slurmctld[19475]: slurmctld: _job_complete: JobId=108 WEXITSTATUS 1 Dec 01 21:47:12 nousheen slurmctld[19475]: slurmctld: _job_complete: JobId=108 done I have total four nodes one of which is the server node. After submitting a job, the job only runs at my server compute node while all the other nodes are IDLE, DOWN or nonresponding. The details are given below: *(base) [nousheen@nousheen slurm]$ scontrol show nodes* NodeName=101 Arch=x86_64 CoresPerSocket=6 CPUAlloc=0 CPUTot=12 CPULoad=0.01 AvailableFeatures=(null) ActiveFeatures=(null) Gres=(null) NodeAddr=192.168.60.101 NodeHostName=101 Version=21.08.4 OS=Linux 3.10.0-1160.59.1.el7.x86_64 #1 SMP Wed Feb 23 16:47:03 UTC 2022 RealMemory=1 AllocMem=0 FreeMem=641 Sockets=1 Boards=1 State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=debug BootTime=2022-11-24T11:18:28 SlurmdStartTime=2022-12-01T21:34:57 LastBusyTime=2022-12-02T00:58:31 CfgTRES=cpu=12,mem=1M,billing=12 AllocTRES= CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s NodeName=104 CoresPerSocket=6 CPUAlloc=0 CPUTot=12 CPULoad=N/A AvailableFeatures=(null) ActiveFeatures=(null) Gres=(null) NodeAddr=192.168.60.114 NodeHostName=104 RealMemory=1 AllocMem=0 FreeMem=N/A Sockets=1 Boards=1 State=DOWN+NOT_RESPONDING ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=debug BootTime=None SlurmdStartTime=None LastBusyTime=2022-12-01T21:37:35 CfgTRES=cpu=12,mem=1M,billing=12 AllocTRES= CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Reason=Not responding [slurm@2022-12-01T16:22:28] NodeName=105 Arch=x86_64 CoresPerSocket=6 CPUAlloc=0 CPUTot=12 CPULoad=1.08 AvailableFeatures=(null) ActiveFeatures=(null) Gres=(null) NodeAddr=192.168.60.115 NodeHostName=105 Version=21.08.4 OS=Linux 3.10.0-1160.76.1.el7.x86_64 #1 SMP Wed Aug 10 16:21:17 UTC 2022 RealMemory=1 AllocMem=0 FreeMem=20723 Sockets=1 Boards=1 State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=debug BootTime=2022-11-24T11:15:37 SlurmdStartTime=2022-12-01T16:15:30 LastBusyTime=2022-12-01T21:47:11 CfgTRES=cpu=12,mem=1M,billing=12 AllocTRES= CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s NodeName=nousheen Arch=x86_64 CoresPerSocket=6 CPUAlloc=8 CPUTot=12 CPULoad=6.73 AvailableFeatures=(null) ActiveFeatures=(null) Gres=(null) NodeAddr=192.168.60.149 NodeHostName=nousheen Version=21.08.5 OS=Linux 3.10.0-1160.15.2.el7.x86_64 #1 SMP Wed Feb 3 15:06:38 UTC 2021 RealMemory=1 AllocMem=0 FreeMem=22736 Sockets=1 Boards=1 State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=debug BootTime=2022-12-01T12:00:08 SlurmdStartTime=2022-12-01T12:00:42 LastBusyTime=2022-12-01T21:37:39 CfgTRES=cpu=12,mem=1M,billing=12 AllocTRES=cpu=8 CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Where as this command shows only one node on which job is running: *(base) [nousheen@nousheen slurm]$ squeue -j* JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 109 debug SRBD-4 nousheen R 3:17:48 1 nousheen Can you please guide me as to why my compute nodes are down and not working? Thank you for your time. Best Regards, Nousheen Parvaiz ᐧ On Thu, Dec 1, 2022 at 8:55 PM Michael Robbert <mrobb...@mines.edu> wrote: > I believe that the error you need to pay attention to for this issue is > this line: > > > > Dec 01 16:17:19 nousheen slurmctld[1631]: slurmctld: error: Check for out > of sync clocks > > > > > > It looks like your compute nodes clock is a full day ahead of your > controller node. Dec. 2 instead of Dec. 1. The clocks need to be in sync > for munge to work. > > > > *Mike Robbert* > > *Cyberinfrastructure Specialist, Cyberinfrastructure and Advanced Research > Computing* > > Information and Technology Solutions (ITS) > > 303-273-3786 | mrobb...@mines.edu > > [image: A close up of a sign Description automatically generated] > > *Our values:* Trust | Integrity | Respect | Responsibility > > > > > > *From: *slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of > Nousheen <nousheenparv...@gmail.com> > *Date: *Thursday, December 1, 2022 at 06:19 > *To: *Slurm User Community List <slurm-users@lists.schedmd.com> > *Subject: *[External] [slurm-users] ERROR: slurmctld: auth/munge: > _print_cred: DECODED > > *CAUTION:* This email originated from outside of the Colorado School of > Mines organization. Do not click on links or open attachments unless you > recognize the sender and know the content is safe. > > > > > > > > Hello Everyone, > > > > I am using slurm version 21.08.5 and Centos 7. > > > > I successfully start slurmd on all compute nodes but when I start > slurmctld on server node it gives the following error: > > > > *(base) [nousheen@nousheen ~]$ systemctl status slurmctld.service -l* > ● slurmctld.service - Slurm controller daemon > Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled; vendor > preset: disabled) > Active: active (running) since Thu 2022-12-01 12:00:42 PKT; 4h 16min ago > Main PID: 1631 (slurmctld) > Tasks: 10 > Memory: 4.0M > CGroup: /system.slice/slurmctld.service > ├─1631 /usr/sbin/slurmctld -D -s > └─1818 slurmctld: slurmscriptd > > Dec 01 16:17:19 nousheen slurmctld[1631]: slurmctld: auth/munge: > _print_cred: DECODED: Thu Dec 01 16:17:19 2022 > Dec 01 16:17:19 nousheen slurmctld[1631]: slurmctld: error: Check for out > of sync clocks > Dec 01 16:17:20 nousheen slurmctld[1631]: slurmctld: error: Munge decode > failed: Rewound credential > Dec 01 16:17:20 nousheen slurmctld[1631]: slurmctld: auth/munge: > _print_cred: ENCODED: Fri Dec 02 16:16:55 2022 > Dec 01 16:17:20 nousheen slurmctld[1631]: slurmctld: auth/munge: > _print_cred: DECODED: Thu Dec 01 16:17:20 2022 > Dec 01 16:17:20 nousheen slurmctld[1631]: slurmctld: error: Check for out > of sync clocks > Dec 01 16:17:21 nousheen slurmctld[1631]: slurmctld: error: Munge decode > failed: Rewound credential > Dec 01 16:17:21 nousheen slurmctld[1631]: slurmctld: auth/munge: > _print_cred: ENCODED: Fri Dec 02 16:16:56 2022 > Dec 01 16:17:21 nousheen slurmctld[1631]: slurmctld: auth/munge: > _print_cred: DECODED: Thu Dec 01 16:17:21 2022 > Dec 01 16:17:21 nousheen slurmctld[1631]: slurmctld: error: Check for out > of sync clocks > > > > When I run the following command on compute nodes I get the following > output: > > > > [gpu101@101 ~]$* munge -n | unmunge* > > STATUS: Success (0) > ENCODE_HOST: ??? (0.0.0.101) > ENCODE_TIME: 2022-12-02 16:33:38 +0500 (1669980818) > DECODE_TIME: 2022-12-02 16:33:38 +0500 (1669980818) > TTL: 300 > CIPHER: aes128 (4) > MAC: sha1 (3) > ZIP: none (0) > UID: gpu101 (1000) > GID: gpu101 (1000) > LENGTH: 0 > > > > Is this error because the encode_host name has question marks and the IP > is also not picked correctly by munge. How can I correct this? All the > nodes keep non-responding when I run a job. However, I have all the clocks > synced across the cluster. > > > > I am new to slurm. Kindly guide me in this matter. > > > > > > > Best Regards, > > Nousheen Parvaiz > Ph.D. Scholar > > > > ᐧ >