[slurm-users] not able to run mpi jobs

masber masber Tue, 22 Mar 2022 10:20:29 -0700

Dear slurm community,

I am quite new to slurm but I got a small slurm cluster with 3 compute nodes 
running.
I can run simple jobs like `srun -N3 hostname` and I am trying now to run an 
mpi helloworld app. My issue is that the job hangs and fails after a few 
seconds.


# srun -N2 -n4 /scratch/helloworld-mpi
srun: error: mpi/pmi2: failed to send temp kvs to compute nodes
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 2.0 ON nid001001-cluster-1 CANCELLED AT 
2022-03-22T16:43:07 ***
srun: error: nid001002-cluster-1: task 3: Killed
srun: launch/slurm: _step_signal: Terminating StepId=2.0
srun: error: nid001001-cluster-1: tasks 0-2: Killed

I can see this in the slurmd logs:

slurmd: debug3: CPUs=40 Boards=1 Sockets=40 Cores=1 Threads=1 Memory=515174 
TmpDisk=211436 Uptime=1920011 CPUSpecList=(null) FeaturesAvail=(null) 
FeaturesActive=(null)
slurmd: debug:  _handle_node_reg_resp: slurmctld sent back 8 TRES.
slurmd: debug3: in the service_connection
slurmd: debug2: Start processing RPC: REQUEST_LAUNCH_TASKS
slurmd: debug2: Processing RPC: REQUEST_LAUNCH_TASKS
slurmd: launch task StepId=2.0 request from UID:0 GID:0 HOST:172.29.113.47 
PORT:40062
slurmd: debug:  Checking credential with 468 bytes of sig data
slurmd: debug2: _group_cache_lookup_internal: no entry found for root
slurmd: debug:  task/affinity: task_p_slurmd_launch_request: task affinity : 
before lllp distribution cpu bind method is '(null type)' ((null))
slurmd: debug3: task/affinity: _get_avail_map: slurmctld s 40 c 1; hw s 40 c 1 
t 1
slurmd: debug3: task/affinity: _get_avail_map: StepId=2.0 core mask from 
slurmctld: 0x0000000007
slurmd: debug3: task/affinity: _get_avail_map: StepId=2.0 CPU final mask for 
local node: 0x0000000000000007
slurmd: debug3: task/affinity: _lllp_map_abstract_masks: 
_lllp_map_abstract_masks
slurmd: debug:  task/affinity: lllp_distribution: binding tasks:3 to nodes:0 
sockets:3:0 cores:3:0 threads:3
slurmd: task/affinity: lllp_distribution: JobId=2 implicit auto binding: 
sockets,one_thread, dist 8192
slurmd: debug2: task/affinity: lllp_distribution: JobId=2 will use lllp_cyclic 
because of SelectTypeParameters
slurmd: task/affinity: _task_layout_lllp_cyclic: _task_layout_lllp_cyclic
slurmd: debug3: task/affinity: _get_avail_map: slurmctld s 40 c 1; hw s 40 c 1 
t 1
slurmd: debug3: task/affinity: _get_avail_map: StepId=2.0 core mask from 
slurmctld: 0x0000000007
slurmd: debug3: task/affinity: _get_avail_map: StepId=2.0 CPU final mask for 
local node: 0x0000000000000007
slurmd: debug3: task/affinity: _task_layout_display_masks: 
_task_layout_display_masks jobid [2:0] 0x0000000000000001
slurmd: debug3: task/affinity: _task_layout_display_masks: 
_task_layout_display_masks jobid [2:1] 0x0000000000000002
slurmd: debug3: task/affinity: _task_layout_display_masks: 
_task_layout_display_masks jobid [2:2] 0x0000000000000004
slurmd: debug3: task/affinity: _lllp_map_abstract_masks: 
_lllp_map_abstract_masks
slurmd: debug3: task/affinity: _task_layout_display_masks: 
_task_layout_display_masks jobid [2:0] 0x0000000000000001
slurmd: debug3: task/affinity: _task_layout_display_masks: 
_task_layout_display_masks jobid [2:1] 0x0000000100000000
slurmd: debug3: task/affinity: _task_layout_display_masks: 
_task_layout_display_masks jobid [2:2] 0x0000000000000002
slurmd: debug3: task/affinity: _lllp_generate_cpu_bind: 3 19 58
slurmd: task/affinity: _lllp_generate_cpu_bind: _lllp_generate_cpu_bind jobid 
[2]: mask_cpu,one_thread, 
0x0000000000000001,0x0000000100000000,0x0000000000000002
slurmd: debug:  task/affinity: task_p_slurmd_launch_request: task affinity : 
after lllp distribution cpu bind method is 'mask_cpu,one_thread' 
(0x0000000000000001,0x0000000100000000,0x0000000000000002)
slurmd: debug2: _insert_job_state: we already have a job state for job 2.  No 
big deal, just an FYI.
slurmd: debug3: _rpc_launch_tasks: call to _forkexec_slurmstepd
slurmd: debug2: _read_slurm_cgroup_conf_int: No cgroup.conf file 
(/etc/slurm/cgroup.conf)
slurmd: debug3: slurmstepd rank 0 (nid001001-cluster-1), parent rank -1 (NONE), 
children 0, depth 0, max_depth 0
slurmd: debug3: _rpc_launch_tasks: return from _forkexec_slurmstepd
slurmd: debug2: Finish processing RPC: REQUEST_LAUNCH_TASKS
slurmd: debug3: in the service_connection
slurmd: debug2: Start processing RPC: REQUEST_FORWARD_DATA
slurmd: debug2: Processing RPC: REQUEST_FORWARD_DATA
slurmd: debug3: Entering _rpc_forward_data, address: 
/var/spool/slurmd/sock.pmi2.2.0, len: 84
slurmd: debug2: slurmd: debug3: CPUs=40 Boards=1 Sockets=40 Cores=1 Threads=1 
Memory=515174 TmpDisk=211436 Uptime=1920011 CPUSpecList=(null) 
FeaturesAvail=(null) FeaturesActive=(null)
slurmd: debug:  _handle_node_reg_resp: slurmctld sent back 8 TRES.
slurmd: debug3: in the service_connection
slurmd: debug2: Start processing RPC: REQUEST_LAUNCH_TASKS
slurmd: debug2: Processing RPC: REQUEST_LAUNCH_TASKS
slurmd: launch task StepId=2.0 request from UID:0 GID:0 HOST:172.29.113.47 
PORT:40062
slurmd: debug:  Checking credential with 468 bytes of sig data
slurmd: debug2: _group_cache_lookup_internal: no entry found for root
slurmd: debug:  task/affinity: task_p_slurmd_launch_request: task affinity : 
before lllp distribution cpu bind method is '(null type)' ((null))
slurmd: debug3: task/affinity: _get_avail_map: slurmctld s 40 c 1; hw s 40 c 1 
t 1
slurmd: debug3: task/affinity: _get_avail_map: StepId=2.0 core mask from 
slurmctld: 0x0000000007
slurmd: debug3: task/affinity: _get_avail_map: StepId=2.0 CPU final mask for 
local node: 0x0000000000000007
slurmd: debug3: task/affinity: _lllp_map_abstract_masks: 
_lllp_map_abstract_masks
slurmd: debug:  task/affinity: lllp_distribution: binding tasks:3 to nodes:0 
sockets:3:0 cores:3:0 threads:3
slurmd: task/affinity: lllp_distribution: JobId=2 implicit auto binding: 
sockets,one_thread, dist 8192
slurmd: debug2: task/affinity: lllp_distribution: JobId=2 will use lllp_cyclic 
because of SelectTypeParameters
slurmd: task/affinity: _task_layout_lllp_cyclic: _task_layout_lllp_cyclic
slurmd: debug3: task/affinity: _get_avail_map: slurmctld s 40 c 1; hw s 40 c 1 
t 1
slurmd: debug3: task/affinity: _get_avail_map: StepId=2.0 core mask from 
slurmctld: 0x0000000007
slurmd: debug3: task/affinity: _get_avail_map: StepId=2.0 CPU final mask for 
local node: 0x0000000000000007
slurmd: debug3: task/affinity: _task_layout_display_masks: 
_task_layout_display_masks jobid [2:0] 0x0000000000000001
slurmd: debug3: task/affinity: _task_layout_display_masks: 
_task_layout_display_masks jobid [2:1] 0x0000000000000002
slurmd: debug3: task/affinity: _task_layout_display_masks: 
_task_layout_display_masks jobid [2:2] 0x0000000000000004
slurmd: debug3: task/affinity: _lllp_map_abstract_masks: 
_lllp_map_abstract_masks
slurmd: debug3: task/affinity: _task_layout_display_masks: 
_task_layout_display_masks jobid [2:0] 0x0000000000000001
slurmd: debug3: task/affinity: _task_layout_display_masks: 
_task_layout_display_masks jobid [2:1] 0x0000000100000000
slurmd: debug3: task/affinity: _task_layout_display_masks: 
_task_layout_display_masks jobid [2:2] 0x0000000000000002
slurmd: debug3: task/affinity: _lllp_generate_cpu_bind: 3 19 58
slurmd: task/affinity: _lllp_generate_cpu_bind: _lllp_generate_cpu_bind jobid 
[2]: mask_cpu,one_thread, 
0x0000000000000001,0x0000000100000000,0x0000000000000002
slurmd: debug:  task/affinity: task_p_slurmd_launch_request: task affinity : 
after lllp distribution cpu bind method is 'mask_cpu,one_thread' 
(0x0000000000000001,0x0000000100000000,0x0000000000000002)
slurmd: debug2: _insert_job_state: we already have a job state for job 2.  No 
big deal, just an FYI.
slurmd: debug3: _rpc_launch_tasks: call to _forkexec_slurmstepd
slurmd: debug2: _read_slurm_cgroup_conf_int: No cgroup.conf file 
(/etc/slurm/cgroup.conf)
slurmd: debug3: slurmstepd rank 0 (nid001001-cluster-1), parent rank -1 (NONE), 
children 0, depth 0, max_depth 0
slurmd: debug3: _rpc_launch_tasks: return from _forkexec_slurmstepd
slurmd: debug2: Finish processing RPC: REQUEST_LAUNCH_TASKS
slurmd: debug3: in the service_connection
slurmd: debug2: Start processing RPC: REQUEST_FORWARD_DATA
slurmd: debug2: Processing RPC: REQUEST_FORWARD_DATA
slurmd: debug3: Entering _rpc_forward_data, address: 
/var/spool/slurmd/sock.pmi2.2.0, len: 84
slurmd: debug2: failed connecting to specified socket 
'/var/spool/slurmd/sock.pmi2.2.0': Connection refused
...


I compiled mpich and I can run mpi jobs outside  slurm

# mpirun -ppn 2 --hosts 
nid001001-cluster-1,nid001003-cluster-1,nid001003-cluster-1 
/scratch/helloworld-mpi
Warning: Permanently added 'nid001003-cluster-1,172.29.9.83' (ECDSA) to the 
list of known hosts.
Hello world from processor nid001003-cluster-1, rank 2 out of 4 processors
Hello world from processor nid001003-cluster-1, rank 3 out of 4 processors
Hello world from processor nid001001-cluster-1, rank 0 out of 4 processors
Hello world from processor nid001001-cluster-1, rank 1 out of 4 processors

Could someone please give me a hint of what to look in regards running mpi jobs 
in slurm?

thank you very much

[slurm-users] not able to run mpi jobs

Reply via email to