Hi,

I am looking at that useful documentation https://slurm.schedmd.com/cpu_management.html

We have people complaining about memory performance issues while running highly distributed jobs in a shared HPC cluster environment. After looking into it, we saw that is either because of concurrent access to the memory with other users, or because we have jump over memory from a socket to another (in a Numa architecture).

Until now, our "easy" answer was to use "exclusive" mode in sbatch, to ensure avoiding any concurrent memory access, and using numactl to check if the job is bound to the other socket.

However, I am looking to a better solution, as exclusive mode took all of a big node (we have only big nodes...), and numactl just check if we will have poor performances or not.

I checked the slurm.conf parameters, but all I saw is about modifying selectType. However, I cannot use other selectType as we have GPU, so we are using cons_tres. Moreover, we already have a very complicated Slurm configuration, and I would like to avoid any side effect.

I am thinking about talking to our user about the `--cpu-bind` (https://slurm.schedmd.com/srun.html#OPT_cpu-bind) method, but I am not sure how to use it, and it seems to be limited to srun ... ?

From a job point of view, I know there are some programs to deal with that, like likwid or placement (but not entirely ?), but that does not seem easy to use, and a slurm would be more suitable and more generic.

Do you have any idea on how to deal with these issues ?

Thanks,

Best regards,

Rémy Dernat
-- 
slurm-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to