Hi Olaf,
Since you are testing Slurm, perhape my Slurm Wiki page may be of interest
to you:
https://wiki.fysik.dtu.dk/niflheim/Slurm_installation
There is a discussion about the setup of Munge.
Best regards,
Ole
On 12/15/20 5:48 PM, Olaf Gellert wrote:
Hi all,
we are setting up a new test cluster to test some features for our
next HPC system. On one of the compute nodes we get these messages
in the log:
[2020-12-15T10:00:21.753] error: Munge decode failed: Invalid credential
[2020-12-15T10:00:21.753] auth/munge: _print_cred: ENCODED: Thu Jan 01
01:00:00 1970
[2020-12-15T10:00:21.753] auth/munge: _print_cred: DECODED: Thu Jan 01
01:00:00 1970
[2020-12-15T10:00:21.753] error: slurm_receive_msg_and_forward:
g_slurm_auth_verify: REQUEST_NODE_REGISTRATION_STATUS has authentication
error: Invalid authentication credential
[2020-12-15T10:00:21.753] error: slurm_receive_msg_and_forward: Protocol
authentication error
[2020-12-15T10:00:21.763] error: service_connection: slurm_receive_msg:
Protocol authentication error
I checked munge authentication in the usual way, so:
- time between nodes is synchronised
- munge is using same UID/GID on both sides
- "munge -c0 -z0 -n | unmunge" works on compute nodes and on slurmctld
node
- ssh slurmcontrolnode "munge -c0 -z0 -n" | unmunge on a compute node
works
- ssh computenode "munge -c0 -z0 -n" | unmunge on the slurmctld node
works
So munge seems to work as far as I can say. What else does
slurm using munge? Are hostnames part of the authentication?
Do I have to wonder about the time "Thu Jan 01 01:00:00 1970"
(in the logs above)?
All machines are CentOS8, slurm is self-built 20.11.0,
munge is from CentOS8 rpm:
munge-0.5.13-1.el8.x86_64
munge-libs-0.5.13-1.el8.x86_64
Cheers, Olaf