Hello,

I am seeing weird errors on our slurmd.log on 4 different nodes. The
errors are similar and I don't understand them:

[2021-09-24T18:27:41.822] slurmd started on Fri, 24 Sep 2021 18:27:41 +0000
[2021-09-24T18:27:41.822] CPUs=36 Boards=1 Sockets=2 Cores=18 Threads=1 
Memory=772485 TmpDisk=93353 Uptime=15975960 CPUSpecList=(null) 
FeaturesAvail=(null) FeaturesActive=(null)
[2021-09-24T18:29:01.002] error: Munge decode failed: Invalid credential
[2021-09-24T18:29:01.002] ENCODED: Thu Jan 01 00:00:00 1970
[2021-09-24T18:29:01.002] DECODED: Thu Jan 01 00:00:00 1970
[2021-09-24T18:29:01.002] error: slurm_receive_msg_and_forward: 
REQUEST_NODE_REGISTRATION_STATUS has authentication error: Invalid 
authentication credential
[2021-09-24T18:29:01.002] error: slurm_receive_msg_and_forward: Protocol 
authentication error
[2021-09-24T18:29:01.012] error: service_connection: slurm_receive_msg: 
Protocol authentication error

These errors appear over and over again.

We have chrony installed on all nodes and the clocks are synchronized.

I can `munge -n | unmunge` succesfully, as well as `munge -n` in one
node and unmunge it on another node.

After I resumed one of those nodes and run a dummy job in it, the
errors disappeared.

What do this errors mean? Why Slurm is trying to encode/decode
credentials from 1970?

Thank you,
Heitor

Attachment: pgpKyEx4DtrEw.pgp
Description: OpenPGP digital signature

Reply via email to