Hello, I am seeing weird errors on our slurmd.log on 4 different nodes. The errors are similar and I don't understand them:
[2021-09-24T18:27:41.822] slurmd started on Fri, 24 Sep 2021 18:27:41 +0000 [2021-09-24T18:27:41.822] CPUs=36 Boards=1 Sockets=2 Cores=18 Threads=1 Memory=772485 TmpDisk=93353 Uptime=15975960 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null) [2021-09-24T18:29:01.002] error: Munge decode failed: Invalid credential [2021-09-24T18:29:01.002] ENCODED: Thu Jan 01 00:00:00 1970 [2021-09-24T18:29:01.002] DECODED: Thu Jan 01 00:00:00 1970 [2021-09-24T18:29:01.002] error: slurm_receive_msg_and_forward: REQUEST_NODE_REGISTRATION_STATUS has authentication error: Invalid authentication credential [2021-09-24T18:29:01.002] error: slurm_receive_msg_and_forward: Protocol authentication error [2021-09-24T18:29:01.012] error: service_connection: slurm_receive_msg: Protocol authentication error These errors appear over and over again. We have chrony installed on all nodes and the clocks are synchronized. I can `munge -n | unmunge` succesfully, as well as `munge -n` in one node and unmunge it on another node. After I resumed one of those nodes and run a dummy job in it, the errors disappeared. What do this errors mean? Why Slurm is trying to encode/decode credentials from 1970? Thank you, Heitor
pgpKyEx4DtrEw.pgp
Description: OpenPGP digital signature