Re: [slurm-users] [External] Munge thinks clocks aren't synced

Williams, Gareth (IM&T, Black Mountain) Wed, 28 Oct 2020 00:07:26 -0700

I’m pretty sure that ntp info indicates ntp is not working. reach=0 so no 
successful connections in many cycles.

https://www.linuxjournal.com/article/6812

Gareth

From: slurm-users <slurm-users-boun...@lists.schedmd.com> On Behalf Of Barbara 
Krašovec
Sent: Wednesday, 28 October 2020 5:41 PM
To: slurm-users@lists.schedmd.com
Subject: Re: [slurm-users] [External] Munge thinks clocks aren't synced

Rewound credential error means that credential appears to have been encoded by 
more than TTL seconds in the future (default munge TTL is 5 minutes). So the 
clock on the decoding host is slower than on the encoding host. You can try to 
run munge with a different TTL (munge -t) just to verify if it is a time sync 
issue. Also check the time on the munge.key.

I don't think it's related to the new subnet.

Cheers,

Barbara
On 10/27/20 9:58 PM, Gard Nelson wrote:
Thanks for your help, Prentice.

Sorry, yes – centos 7.5 installed on a fresh HDD. I rebooted and checked that 
chronyd is disabled. ntpd is running. The rest of the cluster uses centos 7.5 
and ntp so it’s possible, although maybe not ideal.

I’m running ntpq on the new compute node. It is looking to the slurm head node 
which is also set up as the ntp server. Here’s the output:

[root ~]# ntpq -p
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
HEADNODE_IP     .XFAC.          16 u    - 1024    0    0.000    0.000   0.000

It was a bit of a pain to get set up. The time difference was several hours so 
ntp would have taken ages to fix on its own. I have used ntpdate successfully 
on the existing compute nodes, but got a “no server suitable for 
synchronization found” error here. ‘ntpd -gqx’ timed out. So in order to set 
the time, I had to point ntp to the default centos pool of ntp servers to set 
the time and then point it back to the headnode. After that, ‘ntpd -gqx’ ran 
smoothly and I assume (based on the ntpq output) that it worked. Running ‘date’ 
on the new compute and existing head node simultaneously returns the same time 
to within ~1 sec rather than the 7:30 gap from the log file.

Not sure if it’s relevant to this problem, but the new compute node is on a 
different subnet connected to a different port than the existing compute nodes. 
This is the first time that I’ve set up a node on a different subnet. I figured 
it be simple to point slurm to the new node, but I didn’t anticipate ntp and 
munge issues.

Thanks,
Gard

From: slurm-users 
<slurm-users-boun...@lists.schedmd.com><mailto:slurm-users-boun...@lists.schedmd.com>
 on behalf of Prentice Bisbal <pbis...@pppl.gov><mailto:pbis...@pppl.gov>
Reply-To: Slurm User Community List 
<slurm-users@lists.schedmd.com><mailto:slurm-users@lists.schedmd.com>
Date: Tuesday, October 27, 2020 at 12:22 PM
To: "slurm-users@lists.schedmd.com"<mailto:slurm-users@lists.schedmd.com> 
<slurm-users@lists.schedmd.com><mailto:slurm-users@lists.schedmd.com>
Subject: Re: [slurm-users] [External] Munge thinks clocks aren't synced

You don't specify what OS or version you're using. If you're using RHEL 7 or a 
derivative, chrony is used by default over ntpd, so there could be some 
confusion between chronyd and ntpd. If you haven't done so already, I'd check 
to see which daemon is actually running on your system.

Can you share the complete output of ntpq -p with us, and let us know what 
nodes the output is from? You might want to run 'ntpdate' before starting ntpd. 
If the clocks are too far off, either ntpd won't correct the time, or it will 
take a long time. ntpdate immediately syncs up the time between servers.

I would make sure ntpdate is installed and enabled, then reboot both compute 
nodes. This will make sure that ntpdate is called at startup before ntpd, and 
will then make sure all start using the correct time.

--
Prentice

On 10/27/20 2:08 PM, Gard Nelson wrote:
Hi everyone,

I’m adding a new node to an existing cluster. After installing slurm and the 
prereqs, I synced the clocks with ntpd. When I run ‘ntpq -p’, I get 0.0 for 
delay, offset and jitter. (the slurm head node is also the ntp server) ‘date’ 
also gives me identical times for the head and compute nodes. However, when I 
start slurmd, I get a munge error about the clocks being out of sync. From the 
slurmctld log:

[2020-10-27T11:02:06.511] node NEW_NODE returned to service
[2020-10-27T11:02:07.265] error: Munge decode failed: Rewound credential
[2020-10-27T11:02:07.265] ENCODED: Tue Oct 27 11:09:45 2020
[2020-10-27T11:02:07.265] DECODED: Tue Oct 27 11:02:07 2020
[2020-10-27T11:02:07.265] error: Check for out of sync clocks
[2020-10-27T11:02:07.265] error: slurm_unpack_received_msg: 
MESSAGE_NODE_REGISTRATION_STATUS has authentication error: Rewound credential
[2020-10-27T11:02:07.265] error: slurm_unpack_received_msg: Protocol 
authentication error
[2020-10-27T11:02:07.275] error: slurm_receive_msg [HEAD_NODE_IP:PORT]: 
Unspecified error

I restarted ntp, munge and the slurm daemons on both nodes before this last 
error was generated. Any idea what’s going on here?

Thanks,
Gard
CONFIDENTIALITY NOTICE
This e-mail message and any attachments are only for the use of the intended 
recipient and may contain information that is privileged, confidential or 
exempt from disclosure under applicable law. If you are not the intended 
recipient, any disclosure, distribution or other use of this e-mail message or 
attachments is prohibited. If you have received this e-mail message in error, 
please delete and notify the sender immediately. Thank you.

--

Prentice Bisbal

Lead Software Engineer

Research Computing

Princeton Plasma Physics Laboratory

http://www.pppl.gov<https://urldefense.com/v3/__http:/www.pppl.gov__;!!LM3lv1w8qtQ!AUViCRtpIXKV37Z4WGp5j64ppClYVIuzUEXXvfoDHHD_tVjDVMA9b2gBHtaWUHsEPdvmkQ$>

Re: [slurm-users] [External] Munge thinks clocks aren't synced

Reply via email to