Hi fellow slurm users - I’ve been using slurm happily for a few months, but now I feel like it’s gone crazy, and I’m wondering if anyone can explain what’s going on. I have a trivial batch script which I submit multiple times, and ends up with different numbers of nodes allocated. Does anyone have any idea why?
Here’s the output: tin 2028 : cat t #!/bin/bash #SBATCH --ntasks=72 #SBATCH --exclusive #SBATCH --partition=n2019 #SBATCH --ntasks-per-core=1 #SBATCH --time=00:10:00 echo test sleep 600 tin 2029 : sbatch t Submitted batch job 407758 tin 2030 : sbatch t Submitted batch job 407759 tin 2030 : sbatch t Submitted batch job 407760 tin 2030 : squeue -l -u bernstei Wed Mar 27 17:30:51 2019 JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON) 407760 n2019 t bernstei RUNNING 0:03 10:00 3 compute-4-[16-18] 407758 n2019 t bernstei RUNNING 0:06 10:00 2 compute-4-[29-30] 407759 n2019 t bernstei RUNNING 0:06 10:00 2 compute-4-[21,28] All the compute-4-* nodes have 36 physical cores, 72 hyperthreads. If I look at the SLURM_* variables, all the jobs show SLURM_NPROCS=72 SLURM_NTASKS=72 SLURM_CPUS_ON_NODE=72 SLURM_NTASKS_PER_CORE=1 but for some reason the job that ends up on 3 nodes, and only that one, shows SLURM_JOB_CPUS_PER_NODE=72(x3) SLURM_TASKS_PER_NODE=24(x3) while the others show the expected SLURM_JOB_CPUS_PER_NODE=72(x2) SLURM_TASKS_PER_NODE=36(x2) I’m using CentOS 7 (via NPACI Rocks) and slurm 18.08.0 via the rocks slurm roll. thanks, Noam