Err., are all your nodes on the same time? Actually slurmd will not start if a compute node is too far away in time from the controller node. So you should be OK
I would still check the times on all nodes are in agreement On Wed, Sep 24, 2025, 7:19 PM Bruno Bruzzo via slurm-users < [email protected]> wrote: > Hi, sorry for the late reply. > > We tested your proposal and can confirm that all nodes have each other on > their respective /etc/hosts.We can also confirm that the slurmd port is not > blocked. > > One thing we found useful to reproduce the issue is that if we run srun -w > <node x> and on another session srun -w <node x>, the second srun waits for > resources while the first one gets into <node x>. If we exit the session on > the first shell, the one that was waiting gets error: security > violation/invalid job credentials instead of getting into <node x>. > > We also found that scontrol ping not only fails on the login node, but > also on the nodes of a specific partition, showing the larger message: > > Slurmctld(primary) at <headnode> is DOWN > ***************************************** > ** RESTORE SLURMCTLD DAEMON TO SERVICE ** > ***************************************** > Still, slurm is able to assign those nodes for jobs. > > We also raised debug to the max on slurmctld, and when doing the scontrol > ping, we get this log: > [2025-09-24T14:45:16] auth/munge: _print_cred: ENCODED: Wed Dec 31 > 21:00:00 1969 > [2025-09-24T14:45:16] auth/munge: _print_cred: DECODED: Wed Dec 31 > 21:00:00 1969 > [2025-09-24T14:45:16] error: slurm_unpack_received_msg: [[snmgt01]:55274] > auth_g_verify: REQUEST_PING has authentication error: Unspecified error > [2025-09-24T14:45:16] error: slurm_unpack_received_msg: [[snmgt01]:55274] > Protocol authentication error > [2025-09-24T14:45:16] error: slurm_receive_msg [172.28.253.11:55274]: > Protocol authentication error > [2025-09-24T14:45:16] error: Munge decode failed: Unauthorized credential > for client UID=202 GID=202 > [2025-09-24T14:45:16] auth/munge: _print_cred: ENCODED: Wed Dec 31 > 21:00:00 1969 [2025-09-24T14:45:16] auth/munge: _print_cred: DECODED: Wed > Dec 31 21:00:00 1969 > [2025-09-24T14:45:16] error: slurm_unpack_received_msg: [[snmgt01]:55286] > auth_g_verify: REQUEST_PING has authentication error: Unspecified error > [2025-09-24T14:45:16] error: slurm_unpack_received_msg: [[snmgt01]:55286] > Protocol authentication error [2025-09-24T14:45:16] error: > slurm_receive_msg [172.28.253.11:55286]: Protocol authentication error > [2025-09-24T14:45:16] error: Munge decode failed: Unauthorized credential > for client UID=202 GID=202 > > I find it suspicious that the date munge shows is Wed Dec 31 21:00:00 > 1969. I checked correct ownership of the munge.key and that all nodes have > the same file. > > Does anyone has more documentation on what scontrol ping does? We haven't > found detailed information on the docs. > > Best regards, > Bruno Bruzzo > System Administrator - Clementina XXI > > > El vie, 29 ago 2025 a la(s) 3:47 a.m., Bjørn-Helge Mevik via slurm-users ( > [email protected]) escribió: > >> Bruno Bruzzo via slurm-users <[email protected]> writes: >> >> > slurmctld runs on management node mmgt01. >> > srun and salloc fail intermittently on login node, that means >> > we can successfully use srun on login node from time to time, but it >> > stops working for a while without us changing any configuration. >> >> This, to me, sounds like there could be a problem on the compute nodes, >> or the communication between logins and computes. One thing that have >> bit me several times over the years, is compute nodes missing from >> /etc/hosts on other compute nodes. Slurmctld is often sending messages >> to computes via other computes, and if the messages happen go go via a >> node that does not have the target compute in its /etc/hosts, it cannot >> forward the message. >> >> Another thing to look out for, is to check whether any nodes running >> slurmd (computes or logins) have their slurmd port blocked by firewalld >> or something else. >> >> > scontrol ping always shows DOWN from login node, even when we can >> > successfully >> > run srun or salloc. >> >> This might indicate that the slurmctld port on mmgt01 is blocked, or the >> slurmd port on the logins. >> >> It might be something completely different, but I'd at least check >> /etc/hosts >> on all nodes (controller, logins, computes) and check that all needed >> ports are unblocked. >> >> -- >> Regards, >> Bjørn-Helge Mevik, dr. scient, >> Department for Research Computing, University of Oslo >> >> -- >> slurm-users mailing list -- [email protected] >> To unsubscribe send an email to [email protected] >> > > -- > slurm-users mailing list -- [email protected] > To unsubscribe send an email to [email protected] >
-- slurm-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
