Forgot to add: the s2n-tls comes from EPEL and is ver 1.5.10.


On 2025-09-25, 11:56 AM, "Grigory Shamov via slurm-users" 
<[email protected] <mailto:[email protected]>> wrote:


Caution! This message was sent from outside the University of Manitoba.




Hi All,


We have updated SLURM to the current 25.05.x and tried to enable TLS on it. The 
OS is Alma 8.10, cgroups v1, and PMIx v 4.


We see that srun fails for MPI jobs across the nodes, with TLS related errors 
when using PMIx (the default) but passes with srun --mpi=pmi2 or with mpirun .


TLSType = tls/s2n
TLSParameters = ca_cert_file= (has all the certs here under /etc/slurm/certs)


And the errors when using PMIx are


025-09-25T11:04:43.894] error: con_close_on_poll_error: [n388:6818(fd:15)] 
socket error encountered while polling: Connection reset by peer
[2025-09-25T11:04:50.102] [6451416.0] error: _negotiate: s2n_negotiate() failed 
S2N_ERR_CERT_UNTRUSTED[335544366]: Certificate is untrusted -> Error 
encountered in /builddir/build/BUILD/s2n-tls-1.5.10/tls/s2n_x509_validator.c:494
(couple of these)
[2025-09-25T11:05:57.878] [6451416.0] error: tls_p_recv: s2n_recv() failed 
S2N_ERR_CLOSED[134217728]: connection is closed -> Error encountered in 
/builddir/build/BUILD/s2n-tls-1.5.10/utils/s2n_io.c:37
[2025-09-25T11:05:57.883] [6451416.0] error: tls_p_send: s2n_send() failed 
S2N_ERR_IO[67108864]: underlying I/O operation failed, check system errno -> 
Error encountered in /builddir/build/BUILD/s2n-tls-1.5.10/utils/s2n_io.c:28
(couple of these)
[2025-09-25T11:05:59.076] error: wrap_on_data: 
[unix:/var/spool/slurmd/slurmd.socket(fd:17)] on_data returned rc: Unable to 
proxy slurmstepd message
[2025-09-25T11:05:59.076] [6451416.0] error: _stepd_send_recv_msg: slurmd was 
unable to proxy request message to its final destination
[2025-09-25T11:05:59.878] error: _slurmd_send_recv_msg: Failed to send/recv 
slurmstepd message MESSAGE_TASK_EXIT using proxy_type PROXY_TO_NODE_SEND_RECV


2025-09-25T11:07:36.335] [6451416.0] error: mpi/pmix_v4: pmixp_p2p_send: n388 
[0]: pmixp_utils.c:469: send failed, rc=1001, exceeded the retry limit
[2025-09-25T11:07:36.335] [6451416.0] error: mpi/pmix_v4: _slurm_send: n388 
[0]: pmixp_server.c:1586: Cannot send message to 
/var/spool/slurmd/stepd.slurm.pmix.6451416.0, size = 27679, hostlist:
(null)
(and couple more PMIx errors). Looks like PMIx cannot talk to their peers now ?


There was no specific configuration for the certgen plugin, because SLURM 
documentation seems to say it is optional(?).


I wonder what do we miss here to have SLURM 25.05 in with TLS enabled and PMIx 
working? Any advice appreciated! Thanks!


--
Grigory Shamov
Site Lead / HPC Specialist
University of Manitoba and DRI Alliance Canada






--
slurm-users mailing list -- [email protected] 
<mailto:[email protected]>
To unsubscribe send an email to [email protected] 
<mailto:[email protected]>




-- 
slurm-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to