[slurm-users] srun jobfarming hassle question

Ohlerich, Martin Wed, 18 Jan 2023 04:08:47 -0800

Dear Colleagues,


already for quite some years now are we again and again facing issues on our 
clusters with so-called job-farming (or task-farming) concepts in Slurm jobs 
using srun. And it bothers me that we can hardly help users with requests in 
this regard.


>From the documentation (https://slurm.schedmd.com/srun.html#SECTION_EXAMPLES), 
>it reads like this.

------------------------------------------->

...

#SBATCH --nodes=??

...

srun -N 1 -n 2 ... prog1 &> log.1 &

srun -N 1 -n 1 ... prog2 &> log.2 &

...

wait

------------------------------------------->

should do it, meaning that as many job steps are created and reasonably placed 
to the resources/slots available in the job allocation.


Well, this does not work so really on our clusters. (I'm afraid that I'm just 
too idiotic to use srun here ... )

As long as complete nodes are used, and regular task-per-node/cpus-per-task 
pattern, everything is still manageable. Task and thread placement using srun 
is sometimes still some burden.


But if I want rather free resource specifications, like in the example below 
(with half a node or so), I simply fail to get the desired result.


Ok. We've Haswell nodes with 2 sockets, and each socket has 2 NUMA domains with 
each 7 CPUs. 28 physical cores in total. 56 with Hyperthreading, such that the 
logical CPUs are as follows.

socket   phys, CPU        logic. CPU

0            0                        0,28

0            1                        1,29

0            2                        2,30

...

1            0                       14,42

...

1            13                      27,55

(slurm.conf is attached ... essential is "cm2_inter" partition of "inter" 
cluster)


So, for instance, for an OpenMP-only program, I'd like to place 14 OMP threads 
on the 1st socket, another step with 14 OMP threads on the 2nd socket (of first 
node), etc.

------------------------------------------->

#!/bin/bash
#SBATCH -J jobfarm_test
#SBATCH -o log.%x.%j.%N.out
#SBATCH -D ./
#SBATCH --mail-type=NONE
#SBATCH --time=00:10:00
#SBATCH --export=NONE
#SBATCH --get-user-env
#SBATCH --clusters=cm2
#SBATCH --partition=cm2_std
#SBATCH --nodes=2
module load slurm_setup       # specific to LRZ cluster

export OMP_NUM_THREADS=14
placement=/lrz/sys/tools/placement_test_2021/bin/placement-test.omp_only
srun_opts="-N 1 -n 1 -c $((2*$OMP_NUM_THREADS)) --mpi=none --exclusive 
--export=ALL --cpu_bind=verbose,cores"

for i in $(seq 1 4); do
   srun $srun_opts $placement -d 10 &> log.$i &
done
wait

------------------------------------------->

placemen-test.omp_only is just an OMP executable, where each thread reads the 
/proc/... info about ttid and cpu it is running on an prints it screen (and the 
sleeps in order to persist longer on the cpu in running state).


With the script above, I assumed that this would let run all 4 srun-steps at 
the same time - on each socket one. But it doesn't.


First of all, due to Hyptherthreading, I must specify here "-c 56". If I would 
use "-c 28" (which would be more intuitive to me), the CPUs 0-6,28-32 are used 
(the first NUMA domain). And also then if I use -c 28 or even -c 14, the steps 
don't run at the same time on a node. Only a single step per node at a time.


Removing "--exclusive" doesn't change anything. --cpu_bind to socket doesn't 
have an effect either (here I already shoot into the blue).


I want to avoid more stringent requirements (something like memory), as to 
admit sharing the complete available memory on a node. But even if I reduced 
memory requirement per CPU (ridiculous 1K) does not change anything.



So something I definitely do wrong. But I can't even guess, what? I tried 
really many options. Also -m and -B options. Without success. Complexity is 
killing me here.

>From SchedMD documentation, I assume it shouldn't be so complicated as to use 
>the low level placement options

  - as they described here https://slurm.schedmd.com/mc_support.html



Has someone any some clue on how to use srun for these purposes?

But if not, I would also be glad to learn about alternatives ... if they are as 
convenient as SchedMD promised with 
https://slurm.schedmd.com/srun.html#SECTION_EXAMPLES

Then I can get rid of srun ... maybe. (In my desperation, I even tried GNU 
parallel with SSH process spawning ... :( Fazit: It is not really convenient 
for that purpose.)


Thank you very much in advance!

Kind regards,

Martin

slurm.conf
Description: slurm.conf

[slurm-users] srun jobfarming hassle question

Reply via email to