Dear All,
I need to configure the slurm so the user must take a certain minimum
number of CPU cores for a particular partition(not system-wide). Otherwise,
the job must not run.
Any suggestions will be highly appreciated.
With Thanks and Regards
--
Jeherul Islam
--
slurm-users mailing list --
Hi,
We're running into an issue where slurmctld core-dumps with the
following error. This happens on the backup controller, if it needs to
take over from the primary, _for a second time_.
slurmctld: fatal: bit_cache_init: cannot change size once set
Has anyone seen this error before? Also if the
HI Matthias,
If in fact you do need to build in pmix support in SLURM, remember to either
use the –mpi=pmix option on the srun command line or set the SLURM_MPI_TYPE
env. variable to pmix.
You can actually build multiple variants of the pmix plugin each using a
different verson of pmix in case
Hi Matthias,
Okay this is useful and the fact that the mpi4py works outside of a container
is good news.
It might be worth trying to turn on debugging the in slurm pmix plugin and see
if that gives more info.
May set the PMIxDebug in the mpi.conf file to 1 -
https://slurm.schedmd.com/mpi.conf.
HI Matthias,
It looks like the Open MPI in the containers was not built with PMI1 or PMI2
support, so its defaulting to using PMIx.
You are seeing this error message because the call within Open MPI 4.1.x’s
runtime system to PMIx_Init returned an error.
Namely that there was no PMIx server to co
Hi Howard,
thanks, but my Slurm 24.05 definitely has pmix support (visible in "srun
–mpi=list") and it uses it through "MpiDefault=pmix" in slurm.conf. The
mentioned problem also appears if I use a container with OpenMPI
compiled against same pmix as Slurm 24.05 (which is Ubuntu 24.04 package
{
"emoji": "♥️",
"version": 1
}
--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
Hi Matthias,
I see. It does not freak me out. Unfortunately I have very little
experience working with MPI-in-containers, so I don't know the best way to
debug this.
What I do know is that some ABIs in Slurm change with Slurm major versions
and dependencies need to be recompiled with newer versions
Hello,
I have a testing partition with only a node. That server has 12 CPUs (it's a
very old server) (2 sockets, 6 cores per socket, 1 thread per core). That
partition, called "test.q" only has that node, so by default, partition test.q
has 12 CPUs (all from testing node). However, now I would
Hi Davide,
thanks for reply.
In my clusters OpenMPI is not present on the compute nodes. The
application (nccl-tests) is compiled inside the container against
OpenMPI. So when I run the same container in both clusters it's
effectively the exact same OpenMPI version. I hope you don't freak out
10 matches
Mail list logo