Thanks for the replies. I didn't specify earlier but we're using Inte MPI and the following environment variable, I_MPI_JOB_RESPECT_PROCESS_PLACEMENT, fixed my issue.
#SBATCH --ntasks=980 #SBATCH --ntasks-per-node=16 #SBATCH --exclusive export I_MPI_JOB_RESPECT_PROCESS_PLACEMENT=off mpirun -np $SLURM_NTASKS -perhost $SLURM_NTASKS_PER_NODE /path/to/MPI/app Thanks, - Chansup On Wed, Jul 31, 2019 at 2:01 AM Daniel Letai <d...@letai.org.il> wrote: > > On 7/30/19 6:03 PM, Brian Andrus wrote: > > I think this may be more on how you are calling mpirun and the mapping of > processes. > > With the "--exclusive" option, the processes are given access to all the > cores on each box, so mpirun has a choice. IIRC, the default is to pack > them by slot, so fill one node, then move to the next. Whereas you want to > map by node (one process per node cycling by node) > > From the man for mpirun (openmpi): > *--map-by <foo>* Map to the specified object, defaults to *socket*. > Supported options include slot, hwthread, core, L1cache, L2cache, L3cache, > socket, numa, board, node, sequential, distance, and ppr. Any object can > include modifiers by adding a : and any combination of PE=n (bind n > processing elements to each proc), SPAN (load balance the processes across > the allocation), OVERSUBSCRIBE (allow more processes on a node than > processing elements), and NOOVERSUBSCRIBE. This includes PPR, where the > pattern would be terminated by another colon to separate it from the > modifiers. > > so adding "--map-by node" would give you what you are looking for. > Of course, this syntax is for Openmpi's mpirun command, so YMMV > > If using srun (as recommended) instead of invoking mpirun directly, you > can still achieve the same functionality using exported environment > variables as per the mpirun man page, like this: > > OMPI_MCA_rmaps_base_mapping_policy=node srun --export > OMPI_MCA_rmaps_base_mapping_policy ... > > in you sbatch script. > > Brian Andrus > > > On 7/30/2019 5:14 AM, CB wrote: > > Hi Everyone, > > I've recently discovered that when an MPI job is submitted with the > --exclusive flag, Slurm fills up each node even if the --ntasks-per-node > flag is used to set how many MPI processes is scheduled on each node. > Without the --exclusive flag, Slurm works fine as expected. > > Our system is running with Slurm 17.11.7. > > The following options works that each node has 16 MPI processes until all > 980 MPI processes are scheduled.with total of 62 compute nodes. Each of > the 61 nodes has 16 MPI processes and the last one has 4 MPI processes, > which is 980 MPI processes in total. > #SBATCH -n 980 > #SBATCH --ntasks-per-node=16 > > However, if the --exclusive option is added, Slurm fills up each node with > 28 MPI processes (the compute node has 28 cores). Interestingly, Slurm > still allocates 62 compute nodes although only 35 nodes of them are > actually used to distribute 980 MPI processes. > > #SBATCH -n 980 > #SBATCH --ntasks-per-node=16 > #SBATCH --exclusive > > Has anyone seen this behavior? > > Thanks, > - Chansup > >