[slurm-users] Re: SRUN and SBATCH network issues on configless login node.

John Hearns via slurm-users Wed, 24 Sep 2025 21:28:17 -0700

Err., are all your nodes on the same time?

Actually slurmd will not start if a compute node is too far away in time
from the controller node. So you should be OK


I would still check the times on all nodes are in agreement

On Wed, Sep 24, 2025, 7:19 PM Bruno Bruzzo via slurm-users <
[email protected]> wrote:

> Hi, sorry for the late reply.
>
> We tested your proposal and can confirm that all nodes have each other on
> their respective /etc/hosts.We can also confirm that the slurmd port is not
> blocked.
>
> One thing we found useful to reproduce the issue is that if we run srun -w
> <node x> and on another session srun -w <node x>, the second srun waits for
> resources while the first one gets into <node x>. If we exit the session on
> the first shell, the one that was waiting gets error: security
> violation/invalid job credentials instead of getting into <node x>.
>
> We also found that scontrol ping not only fails on the login node, but
> also on the nodes of a specific partition, showing the larger message:
>
> Slurmctld(primary) at <headnode> is DOWN
> *****************************************
> ** RESTORE SLURMCTLD DAEMON TO SERVICE **
> *****************************************
> Still, slurm is able to assign those nodes for jobs.
>
> We also raised debug to the max on slurmctld, and when doing the scontrol
> ping, we get this log:
> [2025-09-24T14:45:16] auth/munge: _print_cred: ENCODED: Wed Dec 31
> 21:00:00 1969
> [2025-09-24T14:45:16] auth/munge: _print_cred: DECODED: Wed Dec 31
> 21:00:00 1969
> [2025-09-24T14:45:16] error: slurm_unpack_received_msg: [[snmgt01]:55274]
> auth_g_verify: REQUEST_PING has authentication error: Unspecified error
> [2025-09-24T14:45:16] error: slurm_unpack_received_msg: [[snmgt01]:55274]
> Protocol authentication error
> [2025-09-24T14:45:16] error: slurm_receive_msg [172.28.253.11:55274]:
> Protocol authentication error
> [2025-09-24T14:45:16] error: Munge decode failed: Unauthorized credential
> for client UID=202 GID=202
> [2025-09-24T14:45:16] auth/munge: _print_cred: ENCODED: Wed Dec 31
> 21:00:00 1969 [2025-09-24T14:45:16] auth/munge: _print_cred: DECODED: Wed
> Dec 31 21:00:00 1969
> [2025-09-24T14:45:16] error: slurm_unpack_received_msg: [[snmgt01]:55286]
> auth_g_verify: REQUEST_PING has authentication error: Unspecified error
> [2025-09-24T14:45:16] error: slurm_unpack_received_msg: [[snmgt01]:55286]
> Protocol authentication error [2025-09-24T14:45:16] error:
> slurm_receive_msg [172.28.253.11:55286]: Protocol authentication error
> [2025-09-24T14:45:16] error: Munge decode failed: Unauthorized credential
> for client UID=202 GID=202
>
> I find it suspicious that the date munge shows is Wed Dec 31 21:00:00
> 1969. I checked correct ownership of the munge.key and that all nodes have
> the same file.
>
> Does anyone has more documentation on what scontrol ping does? We haven't
> found detailed information on the docs.
>
> Best regards,
> Bruno Bruzzo
> System Administrator - Clementina XXI
>
>
> El vie, 29 ago 2025 a la(s) 3:47 a.m., Bjørn-Helge Mevik via slurm-users (
> [email protected]) escribió:
>
>> Bruno Bruzzo via slurm-users <[email protected]> writes:
>>
>> > slurmctld runs on management node mmgt01.
>> > srun and salloc fail intermittently on login node, that means
>> > we can successfully use srun on login node from time to time, but it
>> > stops working for a while without us changing any configuration.
>>
>> This, to me, sounds like there could be a problem on the compute nodes,
>> or the communication between logins and computes.  One thing that have
>> bit me several times over the years, is compute nodes missing from
>> /etc/hosts on other compute nodes.  Slurmctld is often sending messages
>> to computes via other computes, and if the messages happen go go via a
>> node that does not have the target compute in its /etc/hosts, it cannot
>> forward the message.
>>
>> Another thing to look out for, is to check whether any nodes running
>> slurmd (computes or logins) have their slurmd port blocked by firewalld
>> or something else.
>>
>> > scontrol ping always shows DOWN from login node, even when we can
>> > successfully
>> > run srun or salloc.
>>
>> This might indicate that the slurmctld port on mmgt01 is blocked, or the
>> slurmd port on the logins.
>>
>> It might be something completely different, but I'd at least check
>> /etc/hosts
>> on all nodes (controller, logins, computes) and check that all needed
>> ports are unblocked.
>>
>> --
>> Regards,
>> Bjørn-Helge Mevik, dr. scient,
>> Department for Research Computing, University of Oslo
>>
>> --
>> slurm-users mailing list -- [email protected]
>> To unsubscribe send an email to [email protected]
>>
>
> --
> slurm-users mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
>

-- 
slurm-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[slurm-users] Re: SRUN and SBATCH network issues on configless login node.

Reply via email to