[slurm-users] Re: SRUN and SBATCH network issues on configless login node.

Bruno Bruzzo via slurm-users Wed, 24 Sep 2025 11:44:04 -0700

Yes, all nodes are synchronized with crony.

El mié, 24 sept 2025 a la(s) 3:28 p.m., John Hearns ([email protected])
escribió:


> Err., are all your nodes on the same time?
>
> Actually slurmd will not start if a compute node is too far away in time
> from the controller node. So you should be OK
>
> I would still check the times on all nodes are in agreement
>
> On Wed, Sep 24, 2025, 7:19 PM Bruno Bruzzo via slurm-users <
> [email protected]> wrote:
>
>> Hi, sorry for the late reply.
>>
>> We tested your proposal and can confirm that all nodes have each other on
>> their respective /etc/hosts.We can also confirm that the slurmd port is not
>> blocked.
>>
>> One thing we found useful to reproduce the issue is that if we run srun
>> -w <node x> and on another session srun -w <node x>, the second srun waits
>> for resources while the first one gets into <node x>. If we exit the
>> session on the first shell, the one that was waiting gets error: security
>> violation/invalid job credentials instead of getting into <node x>.
>>
>> We also found that scontrol ping not only fails on the login node, but
>> also on the nodes of a specific partition, showing the larger message:
>>
>> Slurmctld(primary) at <headnode> is DOWN
>> *****************************************
>> ** RESTORE SLURMCTLD DAEMON TO SERVICE **
>> *****************************************
>> Still, slurm is able to assign those nodes for jobs.
>>
>> We also raised debug to the max on slurmctld, and when doing the scontrol
>> ping, we get this log:
>> [2025-09-24T14:45:16] auth/munge: _print_cred: ENCODED: Wed Dec 31
>> 21:00:00 1969
>> [2025-09-24T14:45:16] auth/munge: _print_cred: DECODED: Wed Dec 31
>> 21:00:00 1969
>> [2025-09-24T14:45:16] error: slurm_unpack_received_msg: [[snmgt01]:55274]
>> auth_g_verify: REQUEST_PING has authentication error: Unspecified error
>> [2025-09-24T14:45:16] error: slurm_unpack_received_msg: [[snmgt01]:55274]
>> Protocol authentication error
>> [2025-09-24T14:45:16] error: slurm_receive_msg [172.28.253.11:55274]:
>> Protocol authentication error
>> [2025-09-24T14:45:16] error: Munge decode failed: Unauthorized credential
>> for client UID=202 GID=202
>> [2025-09-24T14:45:16] auth/munge: _print_cred: ENCODED: Wed Dec 31
>> 21:00:00 1969 [2025-09-24T14:45:16] auth/munge: _print_cred: DECODED: Wed
>> Dec 31 21:00:00 1969
>> [2025-09-24T14:45:16] error: slurm_unpack_received_msg: [[snmgt01]:55286]
>> auth_g_verify: REQUEST_PING has authentication error: Unspecified error
>> [2025-09-24T14:45:16] error: slurm_unpack_received_msg: [[snmgt01]:55286]
>> Protocol authentication error [2025-09-24T14:45:16] error:
>> slurm_receive_msg [172.28.253.11:55286]: Protocol authentication error
>> [2025-09-24T14:45:16] error: Munge decode failed: Unauthorized credential
>> for client UID=202 GID=202
>>
>> I find it suspicious that the date munge shows is Wed Dec 31 21:00:00
>> 1969. I checked correct ownership of the munge.key and that all nodes have
>> the same file.
>>
>> Does anyone has more documentation on what scontrol ping does? We haven't
>> found detailed information on the docs.
>>
>> Best regards,
>> Bruno Bruzzo
>> System Administrator - Clementina XXI
>>
>>
>> El vie, 29 ago 2025 a la(s) 3:47 a.m., Bjørn-Helge Mevik via slurm-users (
>> [email protected]) escribió:
>>
>>> Bruno Bruzzo via slurm-users <[email protected]> writes:
>>>
>>> > slurmctld runs on management node mmgt01.
>>> > srun and salloc fail intermittently on login node, that means
>>> > we can successfully use srun on login node from time to time, but it
>>> > stops working for a while without us changing any configuration.
>>>
>>> This, to me, sounds like there could be a problem on the compute nodes,
>>> or the communication between logins and computes.  One thing that have
>>> bit me several times over the years, is compute nodes missing from
>>> /etc/hosts on other compute nodes.  Slurmctld is often sending messages
>>> to computes via other computes, and if the messages happen go go via a
>>> node that does not have the target compute in its /etc/hosts, it cannot
>>> forward the message.
>>>
>>> Another thing to look out for, is to check whether any nodes running
>>> slurmd (computes or logins) have their slurmd port blocked by firewalld
>>> or something else.
>>>
>>> > scontrol ping always shows DOWN from login node, even when we can
>>> > successfully
>>> > run srun or salloc.
>>>
>>> This might indicate that the slurmctld port on mmgt01 is blocked, or the
>>> slurmd port on the logins.
>>>
>>> It might be something completely different, but I'd at least check
>>> /etc/hosts
>>> on all nodes (controller, logins, computes) and check that all needed
>>> ports are unblocked.
>>>
>>> --
>>> Regards,
>>> Bjørn-Helge Mevik, dr. scient,
>>> Department for Research Computing, University of Oslo
>>>
>>> --
>>> slurm-users mailing list -- [email protected]
>>> To unsubscribe send an email to [email protected]
>>>
>>
>> --
>> slurm-users mailing list -- [email protected]
>> To unsubscribe send an email to [email protected]
>>
>

-- 
slurm-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[slurm-users] Re: SRUN and SBATCH network issues on configless login node.

Reply via email to