Hi,
I am looking at that useful documentation
https://slurm.schedmd.com/cpu_management.html
We have people complaining about memory performance issues while running
highly distributed jobs in a shared HPC cluster environment. After
looking into it, we saw that is either because of concurrent access to
the memory with other users, or because we have jump over memory from a
socket to another (in a Numa architecture).
Until now, our "easy" answer was to use "exclusive" mode in sbatch, to
ensure avoiding any concurrent memory access, and using numactl to check
if the job is bound to the other socket.
However, I am looking to a better solution, as exclusive mode took all
of a big node (we have only big nodes...), and numactl just check if we
will have poor performances or not.
I checked the slurm.conf parameters, but all I saw is about modifying
selectType. However, I cannot use other selectType as we have GPU, so we
are using cons_tres. Moreover, we already have a very complicated Slurm
configuration, and I would like to avoid any side effect.
I am thinking about talking to our user about the `--cpu-bind`
(https://slurm.schedmd.com/srun.html#OPT_cpu-bind) method, but I am not
sure how to use it, and it seems to be limited to srun ... ?
From a job point of view, I know there are some programs to deal with
that, like likwid or placement (but not entirely ?), but that does not
seem easy to use, and a slurm would be more suitable and more generic.
Do you have any idea on how to deal with these issues ?
Thanks,
Best regards,
Rémy Dernat
--
slurm-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]