Yes, all nodes are synchronized with crony. El mié, 24 sept 2025 a la(s) 3:28 p.m., John Hearns ([email protected]) escribió:
> Err., are all your nodes on the same time? > > Actually slurmd will not start if a compute node is too far away in time > from the controller node. So you should be OK > > I would still check the times on all nodes are in agreement > > On Wed, Sep 24, 2025, 7:19 PM Bruno Bruzzo via slurm-users < > [email protected]> wrote: > >> Hi, sorry for the late reply. >> >> We tested your proposal and can confirm that all nodes have each other on >> their respective /etc/hosts.We can also confirm that the slurmd port is not >> blocked. >> >> One thing we found useful to reproduce the issue is that if we run srun >> -w <node x> and on another session srun -w <node x>, the second srun waits >> for resources while the first one gets into <node x>. If we exit the >> session on the first shell, the one that was waiting gets error: security >> violation/invalid job credentials instead of getting into <node x>. >> >> We also found that scontrol ping not only fails on the login node, but >> also on the nodes of a specific partition, showing the larger message: >> >> Slurmctld(primary) at <headnode> is DOWN >> ***************************************** >> ** RESTORE SLURMCTLD DAEMON TO SERVICE ** >> ***************************************** >> Still, slurm is able to assign those nodes for jobs. >> >> We also raised debug to the max on slurmctld, and when doing the scontrol >> ping, we get this log: >> [2025-09-24T14:45:16] auth/munge: _print_cred: ENCODED: Wed Dec 31 >> 21:00:00 1969 >> [2025-09-24T14:45:16] auth/munge: _print_cred: DECODED: Wed Dec 31 >> 21:00:00 1969 >> [2025-09-24T14:45:16] error: slurm_unpack_received_msg: [[snmgt01]:55274] >> auth_g_verify: REQUEST_PING has authentication error: Unspecified error >> [2025-09-24T14:45:16] error: slurm_unpack_received_msg: [[snmgt01]:55274] >> Protocol authentication error >> [2025-09-24T14:45:16] error: slurm_receive_msg [172.28.253.11:55274]: >> Protocol authentication error >> [2025-09-24T14:45:16] error: Munge decode failed: Unauthorized credential >> for client UID=202 GID=202 >> [2025-09-24T14:45:16] auth/munge: _print_cred: ENCODED: Wed Dec 31 >> 21:00:00 1969 [2025-09-24T14:45:16] auth/munge: _print_cred: DECODED: Wed >> Dec 31 21:00:00 1969 >> [2025-09-24T14:45:16] error: slurm_unpack_received_msg: [[snmgt01]:55286] >> auth_g_verify: REQUEST_PING has authentication error: Unspecified error >> [2025-09-24T14:45:16] error: slurm_unpack_received_msg: [[snmgt01]:55286] >> Protocol authentication error [2025-09-24T14:45:16] error: >> slurm_receive_msg [172.28.253.11:55286]: Protocol authentication error >> [2025-09-24T14:45:16] error: Munge decode failed: Unauthorized credential >> for client UID=202 GID=202 >> >> I find it suspicious that the date munge shows is Wed Dec 31 21:00:00 >> 1969. I checked correct ownership of the munge.key and that all nodes have >> the same file. >> >> Does anyone has more documentation on what scontrol ping does? We haven't >> found detailed information on the docs. >> >> Best regards, >> Bruno Bruzzo >> System Administrator - Clementina XXI >> >> >> El vie, 29 ago 2025 a la(s) 3:47 a.m., Bjørn-Helge Mevik via slurm-users ( >> [email protected]) escribió: >> >>> Bruno Bruzzo via slurm-users <[email protected]> writes: >>> >>> > slurmctld runs on management node mmgt01. >>> > srun and salloc fail intermittently on login node, that means >>> > we can successfully use srun on login node from time to time, but it >>> > stops working for a while without us changing any configuration. >>> >>> This, to me, sounds like there could be a problem on the compute nodes, >>> or the communication between logins and computes. One thing that have >>> bit me several times over the years, is compute nodes missing from >>> /etc/hosts on other compute nodes. Slurmctld is often sending messages >>> to computes via other computes, and if the messages happen go go via a >>> node that does not have the target compute in its /etc/hosts, it cannot >>> forward the message. >>> >>> Another thing to look out for, is to check whether any nodes running >>> slurmd (computes or logins) have their slurmd port blocked by firewalld >>> or something else. >>> >>> > scontrol ping always shows DOWN from login node, even when we can >>> > successfully >>> > run srun or salloc. >>> >>> This might indicate that the slurmctld port on mmgt01 is blocked, or the >>> slurmd port on the logins. >>> >>> It might be something completely different, but I'd at least check >>> /etc/hosts >>> on all nodes (controller, logins, computes) and check that all needed >>> ports are unblocked. >>> >>> -- >>> Regards, >>> Bjørn-Helge Mevik, dr. scient, >>> Department for Research Computing, University of Oslo >>> >>> -- >>> slurm-users mailing list -- [email protected] >>> To unsubscribe send an email to [email protected] >>> >> >> -- >> slurm-users mailing list -- [email protected] >> To unsubscribe send an email to [email protected] >> >
-- slurm-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
