Hi Durai,
I see the same thing as you on our test-cluster that has
ThreadsPerCore=2
configured in slurm.conf
The double-foo goes away with this:
srun --cpus-per-task=1 --hint=nomultithread echo foo
Having multithreading enabled leads to imho surprising behaviour of
Slurm. My impression is that using it makes the concept of "a CPU" in
Slurm somewhat fuzzy. It becomes unclear and ambiguous what you get when
using the cpu-related options of srun, sbatch and salloc: is it a
CPU-core or is it a CPU-thread?
I think what you found is a bug.
If you run
for c in {4..1}
do
echo "## $c ###"
srun -c $c bash -c 'echo $SLURM_CPU_BIND_LIST'
done
you will get:
## 4 ###
0x003003
## 3 ###
0x003003
## 2 ###
0x001001
## 1 ###
0x000001,0x001000
0x000001,0x001000
You see: requesting 4 and 3 CPUs results in the same cpu-binding as both
need two CPU-cores with 2 threads each. In the "3" case one of it stays
unused but of course is not free for another job.
In the "1" case I would expect to see the same binding as in the "2"
case. If you combine the two values in the list you *do* get the same
value but obviously it's a list of two values and this might be the
origin of the problem.
It is probably related to what's mentioned in the documentation for
'--ntasks':
"[...] The default is one task per node, but note that the
--cpus-per-task option will change this default."
Regards
Hermann
On 3/24/22 1:37 PM, Durai Arasan wrote:
Hello Slurm users,
We are experiencing strange behavior with srun executing commands twice
only when setting --cpus-per-task=1
$ srun --cpus-per-task=1 --partition=gpu-2080ti echo foo
srun: job 1298286 queued and waiting for resources
srun: job 1298286 has been allocated resources
foo
foo
This is not seen when --cpus-per-task is another value:
$ srun --cpus-per-task=3 --partition=gpu-2080ti echo foo
srun: job 1298287 queued and waiting for resources
srun: job 1298287 has been allocated resources
foo
Also when specifying --ntasks:
$ srun -n1 --cpus-per-task=1 --partition=gpu-2080ti echo foo
srun: job 1298288 queued and waiting for resources
srun: job 1298288 has been allocated resources
foo
Relevant slurm.conf settings are:
SelectType=select/cons_tres
SelectTypeParameters=CR_Core_Memory
# example node configuration
NodeName=slurm-bm-58 NodeAddr=xxx.xxx.xxx.xxx Procs=72 Sockets=2
CoresPerSocket=18 ThreadsPerCore=2 RealMemory=354566
Gres=gpu:rtx2080ti:8 Feature=xx_v2.38 State=UNKNOWN
On closer of job variables in the "--cpus-per-task=1" case, the
following variables have wrongly acquired a value of 2 for no reason:
SLURM_NTASKS=2
SLURM_NPROCS=2
SLURM_TASKS_PER_NODE=2
SLURM_STEP_NUM_TASKS=2
SLURM_STEP_TASKS_PER_NODE=2
Can you see what could be wrong?
Best,
Durai