I have a user who is submitting a job to slurm which requests 16 tasks, i.e.
#SBATCH --ntasks 16 #SBATCH –cpus-per-task 1 The slurm script runs an mpi program called Parent.mpi, which then (fails to) call 15 mpi child processes. He’s tried two different ways for the parent to spawn the children: 1. A system() call, such as system(“srun --ntasks=4 mpirun -np 4 ./child.mpi”) or system(“mpirun -np 4 ./child.mpi”) 1. MPI_Comm_Spawn Both ways generate the following in the slurm output file: srun: Job ### step creation temporarily disabled, retrying (Requested nodes are busy) srun: error: Unable to create step for job ###: Job/step already completing or completed So, basically, he’s requesting 16 tasks, one of which is used by the parent and the other 15 are supposed to get used by the children, but the children can’t use the other 16 because...well, I’m not sure why. Is there something I need to change in the slurm.conf to allow this to work? --- Mike VanHorn Senior Computer Systems Administrator College of Engineering and Computer Science Wright State University 265 Russ Engineering Center 937-775-5157 michael.vanh...@wright.edu