[slurm-users] Re: SRUN and SBATCH network issues on configless login node.

Bruno Bruzzo via slurm-users Fri, 17 Oct 2025 22:40:17 -0700

Update:
We have solved the issue.

Our problem was that even tough we have a configless configuration, our
provisioning served a unconfigured slurm.conf file to /etc/slurm


On the failing nodes, we could see:

scontrol show config | grep -i "hash_val"

cn080: HASH_VAL                = Different Ours=<...> a Slurmctld=<...>


While on working nodes we saw:

scontrol show config | grep -i "hash_val"

cn044: HASH_VAL                = Match


Note: The failing nodes could still get jobs scheduled via sbatch. The
issue was with srun/salloc.

We removed the slurm.conf file, restarted services, and for now, everything
works fine.

Thanks for the support.

Bruno Bruzzo
System Administrator - Clementina XXI

El mié, 24 sept 2025 a la(s) 3:51 p.m., John Hearns ([email protected])
escribió:

> Shot down in 🔥🔥
>
> On Wed, Sep 24, 2025, 7:43 PM Bruno Bruzzo <[email protected]> wrote:
>
>> Yes, all nodes are synchronized with crony.
>>
>> El mié, 24 sept 2025 a la(s) 3:28 p.m., John Hearns ([email protected])
>> escribió:
>>
>>> Err., are all your nodes on the same time?
>>>
>>> Actually slurmd will not start if a compute node is too far away in time
>>> from the controller node. So you should be OK
>>>
>>> I would still check the times on all nodes are in agreement
>>>
>>> On Wed, Sep 24, 2025, 7:19 PM Bruno Bruzzo via slurm-users <
>>> [email protected]> wrote:
>>>
>>>> Hi, sorry for the late reply.
>>>>
>>>> We tested your proposal and can confirm that all nodes have each other
>>>> on their respective /etc/hosts.We can also confirm that the slurmd port is
>>>> not blocked.
>>>>
>>>> One thing we found useful to reproduce the issue is that if we run srun
>>>> -w <node x> and on another session srun -w <node x>, the second srun waits
>>>> for resources while the first one gets into <node x>. If we exit the
>>>> session on the first shell, the one that was waiting gets error: security
>>>> violation/invalid job credentials instead of getting into <node x>.
>>>>
>>>> We also found that scontrol ping not only fails on the login node, but
>>>> also on the nodes of a specific partition, showing the larger message:
>>>>
>>>> Slurmctld(primary) at <headnode> is DOWN
>>>> *****************************************
>>>> ** RESTORE SLURMCTLD DAEMON TO SERVICE **
>>>> *****************************************
>>>> Still, slurm is able to assign those nodes for jobs.
>>>>
>>>> We also raised debug to the max on slurmctld, and when doing the
>>>> scontrol ping, we get this log:
>>>> [2025-09-24T14:45:16] auth/munge: _print_cred: ENCODED: Wed Dec 31
>>>> 21:00:00 1969
>>>> [2025-09-24T14:45:16] auth/munge: _print_cred: DECODED: Wed Dec 31
>>>> 21:00:00 1969
>>>> [2025-09-24T14:45:16] error: slurm_unpack_received_msg:
>>>> [[snmgt01]:55274] auth_g_verify: REQUEST_PING has authentication error:
>>>> Unspecified error
>>>> [2025-09-24T14:45:16] error: slurm_unpack_received_msg:
>>>> [[snmgt01]:55274] Protocol authentication error
>>>> [2025-09-24T14:45:16] error: slurm_receive_msg [172.28.253.11:55274]:
>>>> Protocol authentication error
>>>> [2025-09-24T14:45:16] error: Munge decode failed: Unauthorized
>>>> credential for client UID=202 GID=202
>>>> [2025-09-24T14:45:16] auth/munge: _print_cred: ENCODED: Wed Dec 31
>>>> 21:00:00 1969 [2025-09-24T14:45:16] auth/munge: _print_cred: DECODED: Wed
>>>> Dec 31 21:00:00 1969
>>>> [2025-09-24T14:45:16] error: slurm_unpack_received_msg:
>>>> [[snmgt01]:55286] auth_g_verify: REQUEST_PING has authentication error:
>>>> Unspecified error
>>>> [2025-09-24T14:45:16] error: slurm_unpack_received_msg:
>>>> [[snmgt01]:55286] Protocol authentication error [2025-09-24T14:45:16]
>>>> error: slurm_receive_msg [172.28.253.11:55286]: Protocol
>>>> authentication error
>>>> [2025-09-24T14:45:16] error: Munge decode failed: Unauthorized
>>>> credential for client UID=202 GID=202
>>>>
>>>> I find it suspicious that the date munge shows is Wed Dec 31 21:00:00
>>>> 1969. I checked correct ownership of the munge.key and that all nodes have
>>>> the same file.
>>>>
>>>> Does anyone has more documentation on what scontrol ping does? We
>>>> haven't found detailed information on the docs.
>>>>
>>>> Best regards,
>>>> Bruno Bruzzo
>>>> System Administrator - Clementina XXI
>>>>
>>>>
>>>> El vie, 29 ago 2025 a la(s) 3:47 a.m., Bjørn-Helge Mevik via
>>>> slurm-users ([email protected]) escribió:
>>>>
>>>>> Bruno Bruzzo via slurm-users <[email protected]> writes:
>>>>>
>>>>> > slurmctld runs on management node mmgt01.
>>>>> > srun and salloc fail intermittently on login node, that means
>>>>> > we can successfully use srun on login node from time to time, but it
>>>>> > stops working for a while without us changing any configuration.
>>>>>
>>>>> This, to me, sounds like there could be a problem on the compute nodes,
>>>>> or the communication between logins and computes.  One thing that have
>>>>> bit me several times over the years, is compute nodes missing from
>>>>> /etc/hosts on other compute nodes.  Slurmctld is often sending messages
>>>>> to computes via other computes, and if the messages happen go go via a
>>>>> node that does not have the target compute in its /etc/hosts, it cannot
>>>>> forward the message.
>>>>>
>>>>> Another thing to look out for, is to check whether any nodes running
>>>>> slurmd (computes or logins) have their slurmd port blocked by firewalld
>>>>> or something else.
>>>>>
>>>>> > scontrol ping always shows DOWN from login node, even when we can
>>>>> > successfully
>>>>> > run srun or salloc.
>>>>>
>>>>> This might indicate that the slurmctld port on mmgt01 is blocked, or
>>>>> the
>>>>> slurmd port on the logins.
>>>>>
>>>>> It might be something completely different, but I'd at least check
>>>>> /etc/hosts
>>>>> on all nodes (controller, logins, computes) and check that all needed
>>>>> ports are unblocked.
>>>>>
>>>>> --
>>>>> Regards,
>>>>> Bjørn-Helge Mevik, dr. scient,
>>>>> Department for Research Computing, University of Oslo
>>>>>
>>>>> --
>>>>> slurm-users mailing list -- [email protected]
>>>>> To unsubscribe send an email to [email protected]
>>>>>
>>>>
>>>> --
>>>> slurm-users mailing list -- [email protected]
>>>> To unsubscribe send an email to [email protected]
>>>>
>>>

-- 
slurm-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[slurm-users] Re: SRUN and SBATCH network issues on configless login node.

Reply via email to