Update: We have solved the issue. Our problem was that even tough we have a configless configuration, our provisioning served a unconfigured slurm.conf file to /etc/slurm
On the failing nodes, we could see: scontrol show config | grep -i "hash_val" cn080: HASH_VAL = Different Ours=<...> a Slurmctld=<...> While on working nodes we saw: scontrol show config | grep -i "hash_val" cn044: HASH_VAL = Match Note: The failing nodes could still get jobs scheduled via sbatch. The issue was with srun/salloc. We removed the slurm.conf file, restarted services, and for now, everything works fine. Thanks for the support. Bruno Bruzzo System Administrator - Clementina XXI El mié, 24 sept 2025 a la(s) 3:51 p.m., John Hearns ([email protected]) escribió: > Shot down in 🔥🔥 > > On Wed, Sep 24, 2025, 7:43 PM Bruno Bruzzo <[email protected]> wrote: > >> Yes, all nodes are synchronized with crony. >> >> El mié, 24 sept 2025 a la(s) 3:28 p.m., John Hearns ([email protected]) >> escribió: >> >>> Err., are all your nodes on the same time? >>> >>> Actually slurmd will not start if a compute node is too far away in time >>> from the controller node. So you should be OK >>> >>> I would still check the times on all nodes are in agreement >>> >>> On Wed, Sep 24, 2025, 7:19 PM Bruno Bruzzo via slurm-users < >>> [email protected]> wrote: >>> >>>> Hi, sorry for the late reply. >>>> >>>> We tested your proposal and can confirm that all nodes have each other >>>> on their respective /etc/hosts.We can also confirm that the slurmd port is >>>> not blocked. >>>> >>>> One thing we found useful to reproduce the issue is that if we run srun >>>> -w <node x> and on another session srun -w <node x>, the second srun waits >>>> for resources while the first one gets into <node x>. If we exit the >>>> session on the first shell, the one that was waiting gets error: security >>>> violation/invalid job credentials instead of getting into <node x>. >>>> >>>> We also found that scontrol ping not only fails on the login node, but >>>> also on the nodes of a specific partition, showing the larger message: >>>> >>>> Slurmctld(primary) at <headnode> is DOWN >>>> ***************************************** >>>> ** RESTORE SLURMCTLD DAEMON TO SERVICE ** >>>> ***************************************** >>>> Still, slurm is able to assign those nodes for jobs. >>>> >>>> We also raised debug to the max on slurmctld, and when doing the >>>> scontrol ping, we get this log: >>>> [2025-09-24T14:45:16] auth/munge: _print_cred: ENCODED: Wed Dec 31 >>>> 21:00:00 1969 >>>> [2025-09-24T14:45:16] auth/munge: _print_cred: DECODED: Wed Dec 31 >>>> 21:00:00 1969 >>>> [2025-09-24T14:45:16] error: slurm_unpack_received_msg: >>>> [[snmgt01]:55274] auth_g_verify: REQUEST_PING has authentication error: >>>> Unspecified error >>>> [2025-09-24T14:45:16] error: slurm_unpack_received_msg: >>>> [[snmgt01]:55274] Protocol authentication error >>>> [2025-09-24T14:45:16] error: slurm_receive_msg [172.28.253.11:55274]: >>>> Protocol authentication error >>>> [2025-09-24T14:45:16] error: Munge decode failed: Unauthorized >>>> credential for client UID=202 GID=202 >>>> [2025-09-24T14:45:16] auth/munge: _print_cred: ENCODED: Wed Dec 31 >>>> 21:00:00 1969 [2025-09-24T14:45:16] auth/munge: _print_cred: DECODED: Wed >>>> Dec 31 21:00:00 1969 >>>> [2025-09-24T14:45:16] error: slurm_unpack_received_msg: >>>> [[snmgt01]:55286] auth_g_verify: REQUEST_PING has authentication error: >>>> Unspecified error >>>> [2025-09-24T14:45:16] error: slurm_unpack_received_msg: >>>> [[snmgt01]:55286] Protocol authentication error [2025-09-24T14:45:16] >>>> error: slurm_receive_msg [172.28.253.11:55286]: Protocol >>>> authentication error >>>> [2025-09-24T14:45:16] error: Munge decode failed: Unauthorized >>>> credential for client UID=202 GID=202 >>>> >>>> I find it suspicious that the date munge shows is Wed Dec 31 21:00:00 >>>> 1969. I checked correct ownership of the munge.key and that all nodes have >>>> the same file. >>>> >>>> Does anyone has more documentation on what scontrol ping does? We >>>> haven't found detailed information on the docs. >>>> >>>> Best regards, >>>> Bruno Bruzzo >>>> System Administrator - Clementina XXI >>>> >>>> >>>> El vie, 29 ago 2025 a la(s) 3:47 a.m., Bjørn-Helge Mevik via >>>> slurm-users ([email protected]) escribió: >>>> >>>>> Bruno Bruzzo via slurm-users <[email protected]> writes: >>>>> >>>>> > slurmctld runs on management node mmgt01. >>>>> > srun and salloc fail intermittently on login node, that means >>>>> > we can successfully use srun on login node from time to time, but it >>>>> > stops working for a while without us changing any configuration. >>>>> >>>>> This, to me, sounds like there could be a problem on the compute nodes, >>>>> or the communication between logins and computes. One thing that have >>>>> bit me several times over the years, is compute nodes missing from >>>>> /etc/hosts on other compute nodes. Slurmctld is often sending messages >>>>> to computes via other computes, and if the messages happen go go via a >>>>> node that does not have the target compute in its /etc/hosts, it cannot >>>>> forward the message. >>>>> >>>>> Another thing to look out for, is to check whether any nodes running >>>>> slurmd (computes or logins) have their slurmd port blocked by firewalld >>>>> or something else. >>>>> >>>>> > scontrol ping always shows DOWN from login node, even when we can >>>>> > successfully >>>>> > run srun or salloc. >>>>> >>>>> This might indicate that the slurmctld port on mmgt01 is blocked, or >>>>> the >>>>> slurmd port on the logins. >>>>> >>>>> It might be something completely different, but I'd at least check >>>>> /etc/hosts >>>>> on all nodes (controller, logins, computes) and check that all needed >>>>> ports are unblocked. >>>>> >>>>> -- >>>>> Regards, >>>>> Bjørn-Helge Mevik, dr. scient, >>>>> Department for Research Computing, University of Oslo >>>>> >>>>> -- >>>>> slurm-users mailing list -- [email protected] >>>>> To unsubscribe send an email to [email protected] >>>>> >>>> >>>> -- >>>> slurm-users mailing list -- [email protected] >>>> To unsubscribe send an email to [email protected] >>>> >>>
-- slurm-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
