Shot down in 🔥🔥

On Wed, Sep 24, 2025, 7:43 PM Bruno Bruzzo <[email protected]> wrote:

> Yes, all nodes are synchronized with crony.
>
> El mié, 24 sept 2025 a la(s) 3:28 p.m., John Hearns ([email protected])
> escribió:
>
>> Err., are all your nodes on the same time?
>>
>> Actually slurmd will not start if a compute node is too far away in time
>> from the controller node. So you should be OK
>>
>> I would still check the times on all nodes are in agreement
>>
>> On Wed, Sep 24, 2025, 7:19 PM Bruno Bruzzo via slurm-users <
>> [email protected]> wrote:
>>
>>> Hi, sorry for the late reply.
>>>
>>> We tested your proposal and can confirm that all nodes have each other
>>> on their respective /etc/hosts.We can also confirm that the slurmd port is
>>> not blocked.
>>>
>>> One thing we found useful to reproduce the issue is that if we run srun
>>> -w <node x> and on another session srun -w <node x>, the second srun waits
>>> for resources while the first one gets into <node x>. If we exit the
>>> session on the first shell, the one that was waiting gets error: security
>>> violation/invalid job credentials instead of getting into <node x>.
>>>
>>> We also found that scontrol ping not only fails on the login node, but
>>> also on the nodes of a specific partition, showing the larger message:
>>>
>>> Slurmctld(primary) at <headnode> is DOWN
>>> *****************************************
>>> ** RESTORE SLURMCTLD DAEMON TO SERVICE **
>>> *****************************************
>>> Still, slurm is able to assign those nodes for jobs.
>>>
>>> We also raised debug to the max on slurmctld, and when doing the
>>> scontrol ping, we get this log:
>>> [2025-09-24T14:45:16] auth/munge: _print_cred: ENCODED: Wed Dec 31
>>> 21:00:00 1969
>>> [2025-09-24T14:45:16] auth/munge: _print_cred: DECODED: Wed Dec 31
>>> 21:00:00 1969
>>> [2025-09-24T14:45:16] error: slurm_unpack_received_msg:
>>> [[snmgt01]:55274] auth_g_verify: REQUEST_PING has authentication error:
>>> Unspecified error
>>> [2025-09-24T14:45:16] error: slurm_unpack_received_msg:
>>> [[snmgt01]:55274] Protocol authentication error
>>> [2025-09-24T14:45:16] error: slurm_receive_msg [172.28.253.11:55274]:
>>> Protocol authentication error
>>> [2025-09-24T14:45:16] error: Munge decode failed: Unauthorized
>>> credential for client UID=202 GID=202
>>> [2025-09-24T14:45:16] auth/munge: _print_cred: ENCODED: Wed Dec 31
>>> 21:00:00 1969 [2025-09-24T14:45:16] auth/munge: _print_cred: DECODED: Wed
>>> Dec 31 21:00:00 1969
>>> [2025-09-24T14:45:16] error: slurm_unpack_received_msg:
>>> [[snmgt01]:55286] auth_g_verify: REQUEST_PING has authentication error:
>>> Unspecified error
>>> [2025-09-24T14:45:16] error: slurm_unpack_received_msg:
>>> [[snmgt01]:55286] Protocol authentication error [2025-09-24T14:45:16]
>>> error: slurm_receive_msg [172.28.253.11:55286]: Protocol authentication
>>> error
>>> [2025-09-24T14:45:16] error: Munge decode failed: Unauthorized
>>> credential for client UID=202 GID=202
>>>
>>> I find it suspicious that the date munge shows is Wed Dec 31 21:00:00
>>> 1969. I checked correct ownership of the munge.key and that all nodes have
>>> the same file.
>>>
>>> Does anyone has more documentation on what scontrol ping does? We
>>> haven't found detailed information on the docs.
>>>
>>> Best regards,
>>> Bruno Bruzzo
>>> System Administrator - Clementina XXI
>>>
>>>
>>> El vie, 29 ago 2025 a la(s) 3:47 a.m., Bjørn-Helge Mevik via slurm-users
>>> ([email protected]) escribió:
>>>
>>>> Bruno Bruzzo via slurm-users <[email protected]> writes:
>>>>
>>>> > slurmctld runs on management node mmgt01.
>>>> > srun and salloc fail intermittently on login node, that means
>>>> > we can successfully use srun on login node from time to time, but it
>>>> > stops working for a while without us changing any configuration.
>>>>
>>>> This, to me, sounds like there could be a problem on the compute nodes,
>>>> or the communication between logins and computes.  One thing that have
>>>> bit me several times over the years, is compute nodes missing from
>>>> /etc/hosts on other compute nodes.  Slurmctld is often sending messages
>>>> to computes via other computes, and if the messages happen go go via a
>>>> node that does not have the target compute in its /etc/hosts, it cannot
>>>> forward the message.
>>>>
>>>> Another thing to look out for, is to check whether any nodes running
>>>> slurmd (computes or logins) have their slurmd port blocked by firewalld
>>>> or something else.
>>>>
>>>> > scontrol ping always shows DOWN from login node, even when we can
>>>> > successfully
>>>> > run srun or salloc.
>>>>
>>>> This might indicate that the slurmctld port on mmgt01 is blocked, or the
>>>> slurmd port on the logins.
>>>>
>>>> It might be something completely different, but I'd at least check
>>>> /etc/hosts
>>>> on all nodes (controller, logins, computes) and check that all needed
>>>> ports are unblocked.
>>>>
>>>> --
>>>> Regards,
>>>> Bjørn-Helge Mevik, dr. scient,
>>>> Department for Research Computing, University of Oslo
>>>>
>>>> --
>>>> slurm-users mailing list -- [email protected]
>>>> To unsubscribe send an email to [email protected]
>>>>
>>>
>>> --
>>> slurm-users mailing list -- [email protected]
>>> To unsubscribe send an email to [email protected]
>>>
>>
-- 
slurm-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to