Hi All,

We have updated SLURM to the current 25.05.x  and tried to enable TLS on it. 
The OS is Alma 8.10, cgroups v1, and PMIx v 4. 

We see that srun fails for MPI jobs across the nodes, with TLS related errors 
when using PMIx (the default) but passes with srun --mpi=pmi2 or with mpirun .

TLSType                 = tls/s2n
TLSParameters           = ca_cert_file= (has all the certs here under 
/etc/slurm/certs)

And the errors when using PMIx are

025-09-25T11:04:43.894] error: con_close_on_poll_error: [n388:6818(fd:15)] 
socket error encountered while polling: Connection reset by peer
[2025-09-25T11:04:50.102] [6451416.0] error: _negotiate: s2n_negotiate() failed 
S2N_ERR_CERT_UNTRUSTED[335544366]: Certificate is untrusted -> Error 
encountered in /builddir/build/BUILD/s2n-tls-1.5.10/tls/s2n_x509_validator.c:494
(couple of these) 
[2025-09-25T11:05:57.878] [6451416.0] error: tls_p_recv: s2n_recv() failed 
S2N_ERR_CLOSED[134217728]: connection is closed -> Error encountered in 
/builddir/build/BUILD/s2n-tls-1.5.10/utils/s2n_io.c:37
[2025-09-25T11:05:57.883] [6451416.0] error: tls_p_send: s2n_send() failed 
S2N_ERR_IO[67108864]: underlying I/O operation failed, check system errno -> 
Error encountered in /builddir/build/BUILD/s2n-tls-1.5.10/utils/s2n_io.c:28
(couple of these)
[2025-09-25T11:05:59.076] error: wrap_on_data: 
[unix:/var/spool/slurmd/slurmd.socket(fd:17)] on_data returned rc: Unable to 
proxy slurmstepd message
[2025-09-25T11:05:59.076] [6451416.0] error: _stepd_send_recv_msg: slurmd was 
unable to proxy request message to its final destination
[2025-09-25T11:05:59.878] error: _slurmd_send_recv_msg: Failed to send/recv 
slurmstepd message MESSAGE_TASK_EXIT using proxy_type PROXY_TO_NODE_SEND_RECV

2025-09-25T11:07:36.335] [6451416.0] error:  mpi/pmix_v4: pmixp_p2p_send: n388 
[0]: pmixp_utils.c:469: send failed, rc=1001, exceeded the retry limit
[2025-09-25T11:07:36.335] [6451416.0] error:  mpi/pmix_v4: _slurm_send: n388 
[0]: pmixp_server.c:1586: Cannot send message to 
/var/spool/slurmd/stepd.slurm.pmix.6451416.0, size = 27679, hostlist:
(null)
(and couple more PMIx errors). Looks like PMIx cannot talk to their peers now ?

There was no  specific configuration for the certgen plugin,  because SLURM 
documentation seems to say it is optional(?).

I wonder what do we miss here to have  SLURM 25.05 in with TLS enabled and PMIx 
working? Any advice appreciated! Thanks! 

-- 
Grigory Shamov
Site Lead / HPC Specialist
University of Manitoba and DRI Alliance Canada



-- 
slurm-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to