Shot down in 🔥🔥 On Wed, Sep 24, 2025, 7:43 PM Bruno Bruzzo <[email protected]> wrote:
> Yes, all nodes are synchronized with crony. > > El mié, 24 sept 2025 a la(s) 3:28 p.m., John Hearns ([email protected]) > escribió: > >> Err., are all your nodes on the same time? >> >> Actually slurmd will not start if a compute node is too far away in time >> from the controller node. So you should be OK >> >> I would still check the times on all nodes are in agreement >> >> On Wed, Sep 24, 2025, 7:19 PM Bruno Bruzzo via slurm-users < >> [email protected]> wrote: >> >>> Hi, sorry for the late reply. >>> >>> We tested your proposal and can confirm that all nodes have each other >>> on their respective /etc/hosts.We can also confirm that the slurmd port is >>> not blocked. >>> >>> One thing we found useful to reproduce the issue is that if we run srun >>> -w <node x> and on another session srun -w <node x>, the second srun waits >>> for resources while the first one gets into <node x>. If we exit the >>> session on the first shell, the one that was waiting gets error: security >>> violation/invalid job credentials instead of getting into <node x>. >>> >>> We also found that scontrol ping not only fails on the login node, but >>> also on the nodes of a specific partition, showing the larger message: >>> >>> Slurmctld(primary) at <headnode> is DOWN >>> ***************************************** >>> ** RESTORE SLURMCTLD DAEMON TO SERVICE ** >>> ***************************************** >>> Still, slurm is able to assign those nodes for jobs. >>> >>> We also raised debug to the max on slurmctld, and when doing the >>> scontrol ping, we get this log: >>> [2025-09-24T14:45:16] auth/munge: _print_cred: ENCODED: Wed Dec 31 >>> 21:00:00 1969 >>> [2025-09-24T14:45:16] auth/munge: _print_cred: DECODED: Wed Dec 31 >>> 21:00:00 1969 >>> [2025-09-24T14:45:16] error: slurm_unpack_received_msg: >>> [[snmgt01]:55274] auth_g_verify: REQUEST_PING has authentication error: >>> Unspecified error >>> [2025-09-24T14:45:16] error: slurm_unpack_received_msg: >>> [[snmgt01]:55274] Protocol authentication error >>> [2025-09-24T14:45:16] error: slurm_receive_msg [172.28.253.11:55274]: >>> Protocol authentication error >>> [2025-09-24T14:45:16] error: Munge decode failed: Unauthorized >>> credential for client UID=202 GID=202 >>> [2025-09-24T14:45:16] auth/munge: _print_cred: ENCODED: Wed Dec 31 >>> 21:00:00 1969 [2025-09-24T14:45:16] auth/munge: _print_cred: DECODED: Wed >>> Dec 31 21:00:00 1969 >>> [2025-09-24T14:45:16] error: slurm_unpack_received_msg: >>> [[snmgt01]:55286] auth_g_verify: REQUEST_PING has authentication error: >>> Unspecified error >>> [2025-09-24T14:45:16] error: slurm_unpack_received_msg: >>> [[snmgt01]:55286] Protocol authentication error [2025-09-24T14:45:16] >>> error: slurm_receive_msg [172.28.253.11:55286]: Protocol authentication >>> error >>> [2025-09-24T14:45:16] error: Munge decode failed: Unauthorized >>> credential for client UID=202 GID=202 >>> >>> I find it suspicious that the date munge shows is Wed Dec 31 21:00:00 >>> 1969. I checked correct ownership of the munge.key and that all nodes have >>> the same file. >>> >>> Does anyone has more documentation on what scontrol ping does? We >>> haven't found detailed information on the docs. >>> >>> Best regards, >>> Bruno Bruzzo >>> System Administrator - Clementina XXI >>> >>> >>> El vie, 29 ago 2025 a la(s) 3:47 a.m., Bjørn-Helge Mevik via slurm-users >>> ([email protected]) escribió: >>> >>>> Bruno Bruzzo via slurm-users <[email protected]> writes: >>>> >>>> > slurmctld runs on management node mmgt01. >>>> > srun and salloc fail intermittently on login node, that means >>>> > we can successfully use srun on login node from time to time, but it >>>> > stops working for a while without us changing any configuration. >>>> >>>> This, to me, sounds like there could be a problem on the compute nodes, >>>> or the communication between logins and computes. One thing that have >>>> bit me several times over the years, is compute nodes missing from >>>> /etc/hosts on other compute nodes. Slurmctld is often sending messages >>>> to computes via other computes, and if the messages happen go go via a >>>> node that does not have the target compute in its /etc/hosts, it cannot >>>> forward the message. >>>> >>>> Another thing to look out for, is to check whether any nodes running >>>> slurmd (computes or logins) have their slurmd port blocked by firewalld >>>> or something else. >>>> >>>> > scontrol ping always shows DOWN from login node, even when we can >>>> > successfully >>>> > run srun or salloc. >>>> >>>> This might indicate that the slurmctld port on mmgt01 is blocked, or the >>>> slurmd port on the logins. >>>> >>>> It might be something completely different, but I'd at least check >>>> /etc/hosts >>>> on all nodes (controller, logins, computes) and check that all needed >>>> ports are unblocked. >>>> >>>> -- >>>> Regards, >>>> Bjørn-Helge Mevik, dr. scient, >>>> Department for Research Computing, University of Oslo >>>> >>>> -- >>>> slurm-users mailing list -- [email protected] >>>> To unsubscribe send an email to [email protected] >>>> >>> >>> -- >>> slurm-users mailing list -- [email protected] >>> To unsubscribe send an email to [email protected] >>> >>
-- slurm-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
