Rewound credential error means that credential appears to have been encoded by more than TTL seconds in the future (default munge TTL is 5 minutes). So the clock on the decoding host is slower than on the encoding host. You can try to run munge with a different TTL (munge -t) just to verify if it is a time sync issue. Also check the time on the munge.key.
I don't think it's related to the new subnet. Cheers, Barbara/ / On 10/27/20 9:58 PM, Gard Nelson wrote: > > Thanks for your help, Prentice. > > > > Sorry, yes – centos 7.5 installed on a fresh HDD. I rebooted and > checked that chronyd is disabled. ntpd is running. The rest of the > cluster uses centos 7.5 and ntp so it’s possible, although maybe not > ideal. > > > > I’m running ntpq on the new compute node. It is looking to the slurm > head node which is also set up as the ntp server. Here’s the output: > > > > [root ~]# ntpq -p > > remote refid st t when poll reach delay > offset jitter > > ============================================================================== > > HEADNODE_IP .XFAC. 16 u - 1024 0 0.000 > 0.000 0.000 > > > > It was a bit of a pain to get set up. The time difference was several > hours so ntp would have taken ages to fix on its own. I have used > ntpdate successfully on the existing compute nodes, but got a “no > server suitable for synchronization found” error here. ‘ntpd -gqx’ > timed out. So in order to set the time, I had to point ntp to the > default centos pool of ntp servers to set the time and then point it > back to the headnode. After that, ‘ntpd -gqx’ ran smoothly and I > assume (based on the ntpq output) that it worked. Running ‘date’ on > the new compute and existing head node simultaneously returns the same > time to within ~1 sec rather than the 7:30 gap from the log file. > > > > Not sure if it’s relevant to this problem, but the new compute node is > on a different subnet connected to a different port than the existing > compute nodes. This is the first time that I’ve set up a node on a > different subnet. I figured it be simple to point slurm to the new > node, but I didn’t anticipate ntp and munge issues. > > > > Thanks, > > Gard > > > > > > > > *From: *slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf > of Prentice Bisbal <pbis...@pppl.gov> > *Reply-To: *Slurm User Community List <slurm-users@lists.schedmd.com> > *Date: *Tuesday, October 27, 2020 at 12:22 PM > *To: *"slurm-users@lists.schedmd.com" <slurm-users@lists.schedmd.com> > *Subject: *Re: [slurm-users] [External] Munge thinks clocks aren't synced > > > > You don't specify what OS or version you're using. If you're using > RHEL 7 or a derivative, chrony is used by default over ntpd, so there > could be some confusion between chronyd and ntpd. If you haven't done > so already, I'd check to see which daemon is actually running on your > system. > > Can you share the complete output of ntpq -p with us, and let us know > what nodes the output is from? You might want to run 'ntpdate' before > starting ntpd. If the clocks are too far off, either ntpd won't > correct the time, or it will take a long time. ntpdate immediately > syncs up the time between servers. > > I would make sure ntpdate is installed and enabled, then reboot both > compute nodes. This will make sure that ntpdate is called at startup > before ntpd, and will then make sure all start using the correct time. > > -- > Prentice > > > > On 10/27/20 2:08 PM, Gard Nelson wrote: > > Hi everyone, > > > > I’m adding a new node to an existing cluster. After installing > slurm and the prereqs, I synced the clocks with ntpd. When I run > ‘ntpq -p’, I get 0.0 for delay, offset and jitter. (the slurm head > node is also the ntp server) ‘date’ also gives me identical times > for the head and compute nodes. However, when I start slurmd, I > get a munge error about the clocks being out of sync. From the > slurmctld log: > > > > [2020-10-27T11:02:06.511] node NEW_NODE returned to service > > [2020-10-27T11:02:07.265] error: Munge decode failed: Rewound > credential > > [2020-10-27T11:02:07.265] ENCODED: Tue Oct 27 11:09:45 2020 > > [2020-10-27T11:02:07.265] DECODED: Tue Oct 27 11:02:07 2020 > > [2020-10-27T11:02:07.265] error: Check for out of sync clocks > > [2020-10-27T11:02:07.265] error: slurm_unpack_received_msg: > MESSAGE_NODE_REGISTRATION_STATUS has authentication error: Rewound > credential > > [2020-10-27T11:02:07.265] error: slurm_unpack_received_msg: > Protocol authentication error > > [2020-10-27T11:02:07.275] error: slurm_receive_msg > [HEAD_NODE_IP:PORT]: Unspecified error > > > > I restarted ntp, munge and the slurm daemons on both nodes before > this last error was generated. Any idea what’s going on here? > > > > Thanks, > > Gard > > > CONFIDENTIALITY NOTICE > This e-mail message and any attachments are only for the > use of the intended recipient and may contain > information that is privileged, confidential or exempt > from disclosure under applicable law. If you are not the > intended recipient, any disclosure, distribution or > other use of this e-mail message or attachments is > prohibited. If you have received this e-mail message in > error, please delete and notify the sender immediately. > Thank you. > > -- > Prentice Bisbal > Lead Software Engineer > Research Computing > Princeton Plasma Physics Laboratory > http://www.pppl.gov > <https://urldefense.com/v3/__http:/www.pppl.gov__;!!LM3lv1w8qtQ!AUViCRtpIXKV37Z4WGp5j64ppClYVIuzUEXXvfoDHHD_tVjDVMA9b2gBHtaWUHsEPdvmkQ$>