Dear friends and colleagues On behalf of Mellanox HPC R&D I would like to emphasize a feature that we introduced in Slurm 17.11 that has been show [1] to significantly improve the speed and scalability of Slurm jobstart.
Starting from this release PMIx plugin supports: (a) Direct point-to-point connections (Direct-connect) for Out-Of-Band (OOB) communications. Prior to 17.11 it was using Slurm RPC mechanism that is very convenient but has some performance-related issues. According to our measurements this significantly improves Slurm/PMIx performance in the direct-modex case [1]. By default this mode is turned on and is using TCP-based implementation. (b) If Slurm is configured with UCX (http://www.openucx.org/) communication framework, PMIx plugin will use UCX-based implementation of the Direct-connect. (c) "Early-wireup" option to pre-connect Slurm step daemons before an application starts using OOB channel. The codebase was extensively tested by us internally but we need a broader testing and looking forward hearing from you about your experience. This implementation demonstrated good results on a small scale [1]. We are currently working on obtaining larger-scale results and invite any interested parties to collaborate. Please contact me through artemp at mellanox.com if you are interested. For testing purposes you can use our recently released jobstart project that we are using internally for development: https://github.com/artpol84/ jobstart. It provides a convenient way to deploy as regular user a testing Slurm instance inside the allocation from a legacy Slurm managing the cluster. Other good thing about this project is that it "bash-documents" the way we configure HPC software stack and can be used as a reference. Some technical details about those features: 1. To build with PMIx and UCX libraries you will need to explicitly configure with both PMIx and UCX: $ ./configure --with-pmix=<pmix-path> --with-ucx=<ucx-path> 2. You can select whether Direct-connect is enabled or not using `SLURM_PMIX_DIRECT_CONN={true|false}` environment variable (envar) on per-jobstep basis. By default TCP-based Direct-connect is on, if Slurm wasn't configured with UCX. 3. If UCX support was turned on during configuration, UCX is used by default for Direct-connect. You can control whether or not UCX is used through `SLURM_PMIX_DIRECT_CONN_UCX={true|false}` envar. If UCX wasn't enabled this envar is ignored. 4. To enable UCX from the very first OOB communication we added the Early-wireup option that pre-connects UCX-based communication tree in parallel with the local portion of MPI/OSHMEM application initialization. By default this feature is turned off and can be controlled using `SLURM_PMIX_DIRECT_CONN_EARLY={true | false}`. As we will get confident with this feature we are planning to turn it on by default. 5. You may also want to specify UCX network device (i.e. UCX_NET_DEVICES=mlx5_0:1) and the transport (UCX_TLS=dc). For now it is recommended to use DC as a transport for the jobstart. Full RC support will be implemented soon. Currently you have to set the global envar (like UCX_TLS) but in the next release we will introduce prefixed envars (like UCX_SLURM_TLS and UCX_SLURM_NET_DEVICES) for a finer grained control over communication resource usage. In the presentation [1] you will also find 2 backup slides explaining how you can enable point-to-point and collectives micro-benchmarks integrated into the PMIx plugin to get some basic reference number for the performance on your system. Jobstart project also contains a simple OSHMEM hello world applications that measures oshmem_init time. [1] Slides that was presented at Slurm booth at SC17: https://slurm.schedmd.com/SC17/Mellanox_Slurm_pmix_UCX_backend_v4.pdf. ---- Best regards, Artem Y. Polyakov Sr. Engineer SW, Mellanox Technologies Inc.