Re: [slurm-users] Slurm and MPICH don't play well together (salloc)

Mccall, Kurt E. (MSFC-EV41) Wed, 29 Dec 2021 10:57:50 -0800

Antony,

I’m  not sure I understand your answer.   I want to launch 2 tasks (managers), 
one per node, but reserve the rest of the cores on each node so that the 
original 2 managers can spawn new workers on them.   Requesting 24 tasks would 
create 24 managers, I think.

Kurt

From: slurm-users <slurm-users-boun...@lists.schedmd.com> On Behalf Of Antony 
Cleave
Sent: Tuesday, December 28, 2021 6:15 PM
To: Slurm User Community List <slurm-users@lists.schedmd.com>
Subject: [EXTERNAL] Re: [slurm-users] Slurm and MPICH don't play well together 
(salloc)

Hi

I've not used mpich for years but I think I see the problem. By asking for 24 
CPUs per task and specifying 2 tasks you are asking slurm to allocate 48 CPUs 
per node.

Your nodes have 24 CPUs in total so you don't have any nodes that can service 
this request

Try asking for 24 tasks. I've only ever used CPU per task for hybrid MPI/openMP 
codes with 2 MPI tasks and 12 threads per task.

Antony

On Tue, 28 Dec 2021, 23:02 Mccall, Kurt E. (MSFC-EV41), 
<kurt.e.mcc...@nasa.gov<mailto:kurt.e.mcc...@nasa.gov>> wrote:
Hi,

My MPICH jobs are being launched and the desired number of processes are 
created, but when one of those processes trys to spawn a new process using 
MPI_Comm_spawn(), that process just spins in the polling code deep within the 
MPICH library.   See the Slurm error message below.   This all works without 
problems on other clusters that have Torque as the process manager.   We are 
using Slurm 20.02.3 on redhat 4.18.0, and MPICH 4.0b1.

salloc: defined options
salloc: -------------------- --------------------
salloc: cpus-per-task       : 24
salloc: ntasks              : 2
salloc: verbose             : 1
salloc: -------------------- --------------------
salloc: end of defined options
salloc: Linear node selection plugin loaded with argument 4
salloc: select/cons_res loaded with argument 4
salloc: Cray/Aries node selection plugin loaded
salloc: select/cons_tres loaded with argument 4
salloc: Granted job allocation 34330
srun: error: Unable to create step for job 34330: Requested node configuration 
is not availableta

I’m wondering if the salloc command I am using is correct.   I intend for it to 
launch 2 processes, one per node, but reserve 24 cores on each node for the 2 
launched processes to spawn new processes using MPI_Comm_spawn.   Could the 
reservation of all 24 cores make slurm or MPICH think that there are no more 
cores available?

salloc –ntasks=2 –cpus-per-task=24 –verbose runscript.bash …

I think that our cluster’s compute nodes are configured correctly –

$ scontrol show node=n001

NodeName=n001 Arch=x86_64 CoresPerSocket=6
   CPUAlloc=0 CPUTot=24 CPULoad=0.00
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=n001 NodeHostName=n001 Version=20.02.3
   OS=Linux 4.18.0-348.el8.x86_64 #1 SMP Mon Oct 4 12:17:22 EDT 2021
   RealMemory=128351 AllocMem=0 FreeMem=126160 Sockets=4 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=normal,low,high
   BootTime=2021-12-21T14:25:05 SlurmdStartTime=2021-12-21T14:25:52
   CfgTRES=cpu=24,mem=128351M,billing=24
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

Thanks for any help.

Re: [slurm-users] Slurm and MPICH don't play well together (salloc)

Reply via email to