Dear slurm community, I am quite new to slurm but I got a small slurm cluster with 3 compute nodes running. I can run simple jobs like `srun -N3 hostname` and I am trying now to run an mpi helloworld app. My issue is that the job hangs and fails after a few seconds.
# srun -N2 -n4 /scratch/helloworld-mpi srun: error: mpi/pmi2: failed to send temp kvs to compute nodes srun: Job step aborted: Waiting up to 32 seconds for job step to finish. slurmstepd: error: *** STEP 2.0 ON nid001001-cluster-1 CANCELLED AT 2022-03-22T16:43:07 *** srun: error: nid001002-cluster-1: task 3: Killed srun: launch/slurm: _step_signal: Terminating StepId=2.0 srun: error: nid001001-cluster-1: tasks 0-2: Killed I can see this in the slurmd logs: slurmd: debug3: CPUs=40 Boards=1 Sockets=40 Cores=1 Threads=1 Memory=515174 TmpDisk=211436 Uptime=1920011 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null) slurmd: debug: _handle_node_reg_resp: slurmctld sent back 8 TRES. slurmd: debug3: in the service_connection slurmd: debug2: Start processing RPC: REQUEST_LAUNCH_TASKS slurmd: debug2: Processing RPC: REQUEST_LAUNCH_TASKS slurmd: launch task StepId=2.0 request from UID:0 GID:0 HOST:172.29.113.47 PORT:40062 slurmd: debug: Checking credential with 468 bytes of sig data slurmd: debug2: _group_cache_lookup_internal: no entry found for root slurmd: debug: task/affinity: task_p_slurmd_launch_request: task affinity : before lllp distribution cpu bind method is '(null type)' ((null)) slurmd: debug3: task/affinity: _get_avail_map: slurmctld s 40 c 1; hw s 40 c 1 t 1 slurmd: debug3: task/affinity: _get_avail_map: StepId=2.0 core mask from slurmctld: 0x0000000007 slurmd: debug3: task/affinity: _get_avail_map: StepId=2.0 CPU final mask for local node: 0x0000000000000007 slurmd: debug3: task/affinity: _lllp_map_abstract_masks: _lllp_map_abstract_masks slurmd: debug: task/affinity: lllp_distribution: binding tasks:3 to nodes:0 sockets:3:0 cores:3:0 threads:3 slurmd: task/affinity: lllp_distribution: JobId=2 implicit auto binding: sockets,one_thread, dist 8192 slurmd: debug2: task/affinity: lllp_distribution: JobId=2 will use lllp_cyclic because of SelectTypeParameters slurmd: task/affinity: _task_layout_lllp_cyclic: _task_layout_lllp_cyclic slurmd: debug3: task/affinity: _get_avail_map: slurmctld s 40 c 1; hw s 40 c 1 t 1 slurmd: debug3: task/affinity: _get_avail_map: StepId=2.0 core mask from slurmctld: 0x0000000007 slurmd: debug3: task/affinity: _get_avail_map: StepId=2.0 CPU final mask for local node: 0x0000000000000007 slurmd: debug3: task/affinity: _task_layout_display_masks: _task_layout_display_masks jobid [2:0] 0x0000000000000001 slurmd: debug3: task/affinity: _task_layout_display_masks: _task_layout_display_masks jobid [2:1] 0x0000000000000002 slurmd: debug3: task/affinity: _task_layout_display_masks: _task_layout_display_masks jobid [2:2] 0x0000000000000004 slurmd: debug3: task/affinity: _lllp_map_abstract_masks: _lllp_map_abstract_masks slurmd: debug3: task/affinity: _task_layout_display_masks: _task_layout_display_masks jobid [2:0] 0x0000000000000001 slurmd: debug3: task/affinity: _task_layout_display_masks: _task_layout_display_masks jobid [2:1] 0x0000000100000000 slurmd: debug3: task/affinity: _task_layout_display_masks: _task_layout_display_masks jobid [2:2] 0x0000000000000002 slurmd: debug3: task/affinity: _lllp_generate_cpu_bind: 3 19 58 slurmd: task/affinity: _lllp_generate_cpu_bind: _lllp_generate_cpu_bind jobid [2]: mask_cpu,one_thread, 0x0000000000000001,0x0000000100000000,0x0000000000000002 slurmd: debug: task/affinity: task_p_slurmd_launch_request: task affinity : after lllp distribution cpu bind method is 'mask_cpu,one_thread' (0x0000000000000001,0x0000000100000000,0x0000000000000002) slurmd: debug2: _insert_job_state: we already have a job state for job 2. No big deal, just an FYI. slurmd: debug3: _rpc_launch_tasks: call to _forkexec_slurmstepd slurmd: debug2: _read_slurm_cgroup_conf_int: No cgroup.conf file (/etc/slurm/cgroup.conf) slurmd: debug3: slurmstepd rank 0 (nid001001-cluster-1), parent rank -1 (NONE), children 0, depth 0, max_depth 0 slurmd: debug3: _rpc_launch_tasks: return from _forkexec_slurmstepd slurmd: debug2: Finish processing RPC: REQUEST_LAUNCH_TASKS slurmd: debug3: in the service_connection slurmd: debug2: Start processing RPC: REQUEST_FORWARD_DATA slurmd: debug2: Processing RPC: REQUEST_FORWARD_DATA slurmd: debug3: Entering _rpc_forward_data, address: /var/spool/slurmd/sock.pmi2.2.0, len: 84 slurmd: debug2: slurmd: debug3: CPUs=40 Boards=1 Sockets=40 Cores=1 Threads=1 Memory=515174 TmpDisk=211436 Uptime=1920011 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null) slurmd: debug: _handle_node_reg_resp: slurmctld sent back 8 TRES. slurmd: debug3: in the service_connection slurmd: debug2: Start processing RPC: REQUEST_LAUNCH_TASKS slurmd: debug2: Processing RPC: REQUEST_LAUNCH_TASKS slurmd: launch task StepId=2.0 request from UID:0 GID:0 HOST:172.29.113.47 PORT:40062 slurmd: debug: Checking credential with 468 bytes of sig data slurmd: debug2: _group_cache_lookup_internal: no entry found for root slurmd: debug: task/affinity: task_p_slurmd_launch_request: task affinity : before lllp distribution cpu bind method is '(null type)' ((null)) slurmd: debug3: task/affinity: _get_avail_map: slurmctld s 40 c 1; hw s 40 c 1 t 1 slurmd: debug3: task/affinity: _get_avail_map: StepId=2.0 core mask from slurmctld: 0x0000000007 slurmd: debug3: task/affinity: _get_avail_map: StepId=2.0 CPU final mask for local node: 0x0000000000000007 slurmd: debug3: task/affinity: _lllp_map_abstract_masks: _lllp_map_abstract_masks slurmd: debug: task/affinity: lllp_distribution: binding tasks:3 to nodes:0 sockets:3:0 cores:3:0 threads:3 slurmd: task/affinity: lllp_distribution: JobId=2 implicit auto binding: sockets,one_thread, dist 8192 slurmd: debug2: task/affinity: lllp_distribution: JobId=2 will use lllp_cyclic because of SelectTypeParameters slurmd: task/affinity: _task_layout_lllp_cyclic: _task_layout_lllp_cyclic slurmd: debug3: task/affinity: _get_avail_map: slurmctld s 40 c 1; hw s 40 c 1 t 1 slurmd: debug3: task/affinity: _get_avail_map: StepId=2.0 core mask from slurmctld: 0x0000000007 slurmd: debug3: task/affinity: _get_avail_map: StepId=2.0 CPU final mask for local node: 0x0000000000000007 slurmd: debug3: task/affinity: _task_layout_display_masks: _task_layout_display_masks jobid [2:0] 0x0000000000000001 slurmd: debug3: task/affinity: _task_layout_display_masks: _task_layout_display_masks jobid [2:1] 0x0000000000000002 slurmd: debug3: task/affinity: _task_layout_display_masks: _task_layout_display_masks jobid [2:2] 0x0000000000000004 slurmd: debug3: task/affinity: _lllp_map_abstract_masks: _lllp_map_abstract_masks slurmd: debug3: task/affinity: _task_layout_display_masks: _task_layout_display_masks jobid [2:0] 0x0000000000000001 slurmd: debug3: task/affinity: _task_layout_display_masks: _task_layout_display_masks jobid [2:1] 0x0000000100000000 slurmd: debug3: task/affinity: _task_layout_display_masks: _task_layout_display_masks jobid [2:2] 0x0000000000000002 slurmd: debug3: task/affinity: _lllp_generate_cpu_bind: 3 19 58 slurmd: task/affinity: _lllp_generate_cpu_bind: _lllp_generate_cpu_bind jobid [2]: mask_cpu,one_thread, 0x0000000000000001,0x0000000100000000,0x0000000000000002 slurmd: debug: task/affinity: task_p_slurmd_launch_request: task affinity : after lllp distribution cpu bind method is 'mask_cpu,one_thread' (0x0000000000000001,0x0000000100000000,0x0000000000000002) slurmd: debug2: _insert_job_state: we already have a job state for job 2. No big deal, just an FYI. slurmd: debug3: _rpc_launch_tasks: call to _forkexec_slurmstepd slurmd: debug2: _read_slurm_cgroup_conf_int: No cgroup.conf file (/etc/slurm/cgroup.conf) slurmd: debug3: slurmstepd rank 0 (nid001001-cluster-1), parent rank -1 (NONE), children 0, depth 0, max_depth 0 slurmd: debug3: _rpc_launch_tasks: return from _forkexec_slurmstepd slurmd: debug2: Finish processing RPC: REQUEST_LAUNCH_TASKS slurmd: debug3: in the service_connection slurmd: debug2: Start processing RPC: REQUEST_FORWARD_DATA slurmd: debug2: Processing RPC: REQUEST_FORWARD_DATA slurmd: debug3: Entering _rpc_forward_data, address: /var/spool/slurmd/sock.pmi2.2.0, len: 84 slurmd: debug2: failed connecting to specified socket '/var/spool/slurmd/sock.pmi2.2.0': Connection refused ... I compiled mpich and I can run mpi jobs outside slurm # mpirun -ppn 2 --hosts nid001001-cluster-1,nid001003-cluster-1,nid001003-cluster-1 /scratch/helloworld-mpi Warning: Permanently added 'nid001003-cluster-1,172.29.9.83' (ECDSA) to the list of known hosts. Hello world from processor nid001003-cluster-1, rank 2 out of 4 processors Hello world from processor nid001003-cluster-1, rank 3 out of 4 processors Hello world from processor nid001001-cluster-1, rank 0 out of 4 processors Hello world from processor nid001001-cluster-1, rank 1 out of 4 processors Could someone please give me a hint of what to look in regards running mpi jobs in slurm? thank you very much