[slurm-users] Minimum cpu cores per node partition level configuration

2025-03-27 Thread Jeherul Islam via slurm-users
Dear All, I need to configure the slurm so the user must take a certain minimum number of CPU cores for a particular partition(not system-wide). Otherwise, the job must not run. Any suggestions will be highly appreciated. With Thanks and Regards -- Jeherul Islam -- slurm-users mailing list --

[slurm-users] bit_cache_init failure on the second time backup controller tries to take control

2025-03-27 Thread Safdar Iqbal via slurm-users
Hi, We're running into an issue where slurmctld core-dumps with the following error. This happens on the backup controller, if it needs to take over from the primary, _for a second time_. slurmctld: fatal: bit_cache_init: cannot change size once set Has anyone seen this error before? Also if the

[slurm-users] Re: [EXTERNAL] Re: [EXTERN] Re: Slurm 24.05 and OpenMPI

2025-03-27 Thread Pritchard Jr., Howard via slurm-users
HI Matthias, If in fact you do need to build in pmix support in SLURM, remember to either use the –mpi=pmix option on the srun command line or set the SLURM_MPI_TYPE env. variable to pmix. You can actually build multiple variants of the pmix plugin each using a different verson of pmix in case

[slurm-users] Re: [EXTERNAL] Re: [EXTERN] Re: Slurm 24.05 and OpenMPI

2025-03-27 Thread Pritchard Jr., Howard via slurm-users
Hi Matthias, Okay this is useful and the fact that the mpi4py works outside of a container is good news. It might be worth trying to turn on debugging the in slurm pmix plugin and see if that gives more info. May set the PMIxDebug in the mpi.conf file to 1 - https://slurm.schedmd.com/mpi.conf.

[slurm-users] Re: [EXTERNAL] Re: [EXTERN] Re: Slurm 24.05 and OpenMPI

2025-03-27 Thread Pritchard Jr., Howard via slurm-users
HI Matthias, It looks like the Open MPI in the containers was not built with PMI1 or PMI2 support, so its defaulting to using PMIx. You are seeing this error message because the call within Open MPI 4.1.x’s runtime system to PMIx_Init returned an error. Namely that there was no PMIx server to co

[slurm-users] Re: [EXTERNAL] Re: [EXTERN] Re: Slurm 24.05 and OpenMPI

2025-03-27 Thread Matthias Leopold via slurm-users
Hi Howard, thanks, but my Slurm 24.05 definitely has pmix support (visible in "srun –mpi=list") and it uses it through "MpiDefault=pmix" in slurm.conf. The mentioned problem also appears if I use a container with OpenMPI compiled against same pmix as Slurm 24.05 (which is Ubuntu 24.04 package

[slurm-users] Re: [EXTERNAL] Re: [EXTERN] Re: Slurm 24.05 and OpenMPI

2025-03-27 Thread Davide DelVento via slurm-users
{ "emoji": "♥️", "version": 1 } -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: [EXTERN] Re: Slurm 24.05 and OpenMPI

2025-03-27 Thread Davide DelVento via slurm-users
Hi Matthias, I see. It does not freak me out. Unfortunately I have very little experience working with MPI-in-containers, so I don't know the best way to debug this. What I do know is that some ABIs in Slurm change with Slurm major versions and dependencies need to be recompiled with newer versions

[slurm-users] How limit CPUs per node in a partition

2025-03-27 Thread Gestió Servidors via slurm-users
Hello, I have a testing partition with only a node. That server has 12 CPUs (it's a very old server) (2 sockets, 6 cores per socket, 1 thread per core). That partition, called "test.q" only has that node, so by default, partition test.q has 12 CPUs (all from testing node). However, now I would

[slurm-users] Re: [EXTERN] Re: Slurm 24.05 and OpenMPI

2025-03-27 Thread Matthias Leopold via slurm-users
Hi Davide, thanks for reply. In my clusters OpenMPI is not present on the compute nodes. The application (nccl-tests) is compiled inside the container against OpenMPI. So when I run the same container in both clusters it's effectively the exact same OpenMPI version. I hope you don't freak out