Grigory,
You likely need to add your CA to the nodes and update. Under Ubuntu,
you would:
* Put your CA public key file in /usr/local/share/ca-certificates/
* Run /usr/sbin/update-ca-certificates
This should then create a pem file in /etc/ssl/certs for that CA and you
can then trust certs signed by it.
You will need to do that on all your systems that need to trust your CA.
Brian Andrus
On 9/25/2025 11:11 AM, Grigory Shamov via slurm-users wrote:
Forgot to add: the s2n-tls comes from EPEL and is ver 1.5.10.
On 2025-09-25, 11:56 AM, "Grigory Shamov via slurm-users"
<[email protected] <mailto:[email protected]>> wrote:
Caution! This message was sent from outside the University of Manitoba.
Hi All,
We have updated SLURM to the current 25.05.x and tried to enable TLS on it. The
OS is Alma 8.10, cgroups v1, and PMIx v 4.
We see that srun fails for MPI jobs across the nodes, with TLS related errors
when using PMIx (the default) but passes with srun --mpi=pmi2 or with mpirun .
TLSType = tls/s2n
TLSParameters = ca_cert_file= (has all the certs here under /etc/slurm/certs)
And the errors when using PMIx are
025-09-25T11:04:43.894] error: con_close_on_poll_error: [n388:6818(fd:15)]
socket error encountered while polling: Connection reset by peer
[2025-09-25T11:04:50.102] [6451416.0] error: _negotiate: s2n_negotiate() failed
S2N_ERR_CERT_UNTRUSTED[335544366]: Certificate is untrusted -> Error
encountered in /builddir/build/BUILD/s2n-tls-1.5.10/tls/s2n_x509_validator.c:494
(couple of these)
[2025-09-25T11:05:57.878] [6451416.0] error: tls_p_recv: s2n_recv() failed
S2N_ERR_CLOSED[134217728]: connection is closed -> Error encountered in
/builddir/build/BUILD/s2n-tls-1.5.10/utils/s2n_io.c:37
[2025-09-25T11:05:57.883] [6451416.0] error: tls_p_send: s2n_send() failed
S2N_ERR_IO[67108864]: underlying I/O operation failed, check system errno ->
Error encountered in /builddir/build/BUILD/s2n-tls-1.5.10/utils/s2n_io.c:28
(couple of these)
[2025-09-25T11:05:59.076] error: wrap_on_data:
[unix:/var/spool/slurmd/slurmd.socket(fd:17)] on_data returned rc: Unable to
proxy slurmstepd message
[2025-09-25T11:05:59.076] [6451416.0] error: _stepd_send_recv_msg: slurmd was
unable to proxy request message to its final destination
[2025-09-25T11:05:59.878] error: _slurmd_send_recv_msg: Failed to send/recv
slurmstepd message MESSAGE_TASK_EXIT using proxy_type PROXY_TO_NODE_SEND_RECV
2025-09-25T11:07:36.335] [6451416.0] error: mpi/pmix_v4: pmixp_p2p_send: n388
[0]: pmixp_utils.c:469: send failed, rc=1001, exceeded the retry limit
[2025-09-25T11:07:36.335] [6451416.0] error: mpi/pmix_v4: _slurm_send: n388
[0]: pmixp_server.c:1586: Cannot send message to
/var/spool/slurmd/stepd.slurm.pmix.6451416.0, size = 27679, hostlist:
(null)
(and couple more PMIx errors). Looks like PMIx cannot talk to their peers now ?
There was no specific configuration for the certgen plugin, because SLURM
documentation seems to say it is optional(?).
I wonder what do we miss here to have SLURM 25.05 in with TLS enabled and PMIx
working? Any advice appreciated! Thanks!
--
Grigory Shamov
Site Lead / HPC Specialist
University of Manitoba and DRI Alliance Canada
--
slurm-users mailing list [email protected]
<mailto:[email protected]>
To unsubscribe send an email [email protected]
<mailto:[email protected]>
--
slurm-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]