[slurm-users] number of nodes varies for no reason?

Noam Bernstein Wed, 27 Mar 2019 14:48:04 -0700

Hi fellow slurm users - I’ve been using slurm happily for a few months, but now 
I feel like it’s gone crazy, and I’m wondering if anyone can explain what’s 
going on.  I have a trivial batch script which I submit multiple times, and 
ends up with different numbers of nodes allocated. Does anyone have any idea 
why?


Here’s the output:

tin 2028 : cat t
#!/bin/bash
#SBATCH --ntasks=72
#SBATCH --exclusive
#SBATCH --partition=n2019
#SBATCH --ntasks-per-core=1
#SBATCH --time=00:10:00

echo test
sleep 600

tin 2029 : sbatch t
Submitted batch job 407758
tin 2030 : sbatch t
Submitted batch job 407759
tin 2030 : sbatch t
Submitted batch job 407760

tin 2030 : squeue -l -u bernstei
Wed Mar 27 17:30:51 2019
             JOBID PARTITION     NAME     USER    STATE       TIME TIME_LIMI  
NODES NODELIST(REASON)
            407760     n2019        t bernstei  RUNNING       0:03     10:00    
  3 compute-4-[16-18]
            407758     n2019        t bernstei  RUNNING       0:06     10:00    
  2 compute-4-[29-30]
            407759     n2019        t bernstei  RUNNING       0:06     10:00    
  2 compute-4-[21,28]

All the compute-4-* nodes have 36 physical cores, 72 hyperthreads.


If I look at the SLURM_* variables, all the jobs show 
SLURM_NPROCS=72
SLURM_NTASKS=72
SLURM_CPUS_ON_NODE=72
SLURM_NTASKS_PER_CORE=1
but for some reason the job that ends up on 3 nodes, and only that one, shows
SLURM_JOB_CPUS_PER_NODE=72(x3)
SLURM_TASKS_PER_NODE=24(x3)
while the others show the expected
SLURM_JOB_CPUS_PER_NODE=72(x2)
SLURM_TASKS_PER_NODE=36(x2)

I’m using CentOS 7 (via NPACI Rocks) and slurm 18.08.0 via the rocks slurm roll.

                                                                                
thanks,
                                                                                
Noam

[slurm-users] number of nodes varies for no reason?

Reply via email to