Hi, sorry for the late reply.

We tested your proposal and can confirm that all nodes have each other on
their respective /etc/hosts.We can also confirm that the slurmd port is not
blocked.

One thing we found useful to reproduce the issue is that if we run srun -w
<node x> and on another session srun -w <node x>, the second srun waits for
resources while the first one gets into <node x>. If we exit the session on
the first shell, the one that was waiting gets error: security
violation/invalid job credentials instead of getting into <node x>.

We also found that scontrol ping not only fails on the login node, but also
on the nodes of a specific partition, showing the larger message:

Slurmctld(primary) at <headnode> is DOWN
*****************************************
** RESTORE SLURMCTLD DAEMON TO SERVICE **
*****************************************
Still, slurm is able to assign those nodes for jobs.

We also raised debug to the max on slurmctld, and when doing the scontrol
ping, we get this log:
[2025-09-24T14:45:16] auth/munge: _print_cred: ENCODED: Wed Dec 31 21:00:00
1969
[2025-09-24T14:45:16] auth/munge: _print_cred: DECODED: Wed Dec 31 21:00:00
1969
[2025-09-24T14:45:16] error: slurm_unpack_received_msg: [[snmgt01]:55274]
auth_g_verify: REQUEST_PING has authentication error: Unspecified error
[2025-09-24T14:45:16] error: slurm_unpack_received_msg: [[snmgt01]:55274]
Protocol authentication error
[2025-09-24T14:45:16] error: slurm_receive_msg [172.28.253.11:55274]:
Protocol authentication error
[2025-09-24T14:45:16] error: Munge decode failed: Unauthorized credential
for client UID=202 GID=202
[2025-09-24T14:45:16] auth/munge: _print_cred: ENCODED: Wed Dec 31 21:00:00
1969 [2025-09-24T14:45:16] auth/munge: _print_cred: DECODED: Wed Dec 31
21:00:00 1969
[2025-09-24T14:45:16] error: slurm_unpack_received_msg: [[snmgt01]:55286]
auth_g_verify: REQUEST_PING has authentication error: Unspecified error
[2025-09-24T14:45:16] error: slurm_unpack_received_msg: [[snmgt01]:55286]
Protocol authentication error [2025-09-24T14:45:16] error:
slurm_receive_msg [172.28.253.11:55286]: Protocol authentication error
[2025-09-24T14:45:16] error: Munge decode failed: Unauthorized credential
for client UID=202 GID=202

I find it suspicious that the date munge shows is Wed Dec 31 21:00:00 1969.
I checked correct ownership of the munge.key and that all nodes have the
same file.

Does anyone has more documentation on what scontrol ping does? We haven't
found detailed information on the docs.

Best regards,
Bruno Bruzzo
System Administrator - Clementina XXI


El vie, 29 ago 2025 a la(s) 3:47 a.m., Bjørn-Helge Mevik via slurm-users (
[email protected]) escribió:

> Bruno Bruzzo via slurm-users <[email protected]> writes:
>
> > slurmctld runs on management node mmgt01.
> > srun and salloc fail intermittently on login node, that means
> > we can successfully use srun on login node from time to time, but it
> > stops working for a while without us changing any configuration.
>
> This, to me, sounds like there could be a problem on the compute nodes,
> or the communication between logins and computes.  One thing that have
> bit me several times over the years, is compute nodes missing from
> /etc/hosts on other compute nodes.  Slurmctld is often sending messages
> to computes via other computes, and if the messages happen go go via a
> node that does not have the target compute in its /etc/hosts, it cannot
> forward the message.
>
> Another thing to look out for, is to check whether any nodes running
> slurmd (computes or logins) have their slurmd port blocked by firewalld
> or something else.
>
> > scontrol ping always shows DOWN from login node, even when we can
> > successfully
> > run srun or salloc.
>
> This might indicate that the slurmctld port on mmgt01 is blocked, or the
> slurmd port on the logins.
>
> It might be something completely different, but I'd at least check
> /etc/hosts
> on all nodes (controller, logins, computes) and check that all needed
> ports are unblocked.
>
> --
> Regards,
> Bjørn-Helge Mevik, dr. scient,
> Department for Research Computing, University of Oslo
>
> --
> slurm-users mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
>
-- 
slurm-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to