Good afternoon, A simulation package I'm using is peforms MPI_Comm_spawn to dynamically spawn simulation processes. This works fine interactively on a head-node, but I'm running into problems when I try to submit a job to the SLURM scheduler using sbatch. If I use salloc instead, it does work. I would like to find out if I can also get to work with sbatch or that this is a known limitation.
I am using Intel compilers and MPI , version 2020 Update 1. It can be reproduced with a small program that does this, the source code of which is given at the end of this message. This works (when the $PWD is in $PATH): salloc -N 2 mpiexec -np 1 spawn_example This does not work sbatch job with job-file containing: #================ #!/bin/bash #SBATCH -N 2 #SBATCH --job-name=P12345.678 source /etc/profile.d/modules.sh export I_MPI_DEBUG=500 mpiexec -np 1 ./spawn_example #================ Further details: If I attach a gdb session to the process that performs the spawning, the following is found. It looks like it is hanging in MPI_Comm_spawn while trying to read something. #0 0x00001555537ce8b2 in read () from /lib64/libpthread.so.0 #1 0x00001555544fd0ef in read (__fd=<optimized out>, __buf=<optimized out>, __nbytes=<optimized out>) at /usr/include/bits/unistd.h:45 #2 PMIU_readline (fd=5, buf=0x155554eedca0 "cmd=get_result rc=0 msg=success value=mpi#0200CD280A950027", '0' <repeats 16 times>, "$\n", maxlen=1023) at ../../src/pmi/simple/simple_pmiutil.c:143 #3 0x00001555544facde in VPMI_Spawn_multiple (count=5, cmds=0x155554eedca0, argvs=0x3ff, maxprocs=0x1555537ce8b2 <read+18>, info_keyval_sizes=0x0, info_keyval_vectors=0x28, preput_keyval_size=1, preput_keyval_vector=0x7f20000a0b40, errors=0x7f20001f1840) at ../../src/pmi/simple/simple_pmi.c:951 #4 0x00001555543a5f69 in MPIR_pmi_spawn_multiple (count=5, commands=0x155554eedca0, argvs=0x3ff, maxprocs=0x1555537ce8b2 <read+18>, info_ptrs=0x0, num_preput_keyval=40, preput_keyvals=0x28, pmi_errcodes=0x1) at ../../src/util/mpir_pmi.c:1417 #5 0x0000155553fdc85d in MPID_Comm_spawn_multiple (count=5, commands=0x155554eedca0, argvs=0x3ff, maxprocs=0x1555537ce8b2 <read+18>, info_ptrs=0x0, root=40, comm_ptr=0x0, intercomm=0x155554022e32 <PMPI_Comm_spawn+946>, errcodes=0x155554f52b60 <MPIR_Comm_builtin>) at ../../src/mpid/ch4/src/ch4_spawn.c:85 #6 0x0000155554022e32 in PMPI_Comm_spawn (command=0x7ffffffface0 "", argv=0x7fffffffadd4, maxprocs=0, info=1400694962, root=0, comm=40, intercomm=0x0, array_of_errcodes=0x7fffffffafd0) at ../../src/mpi/spawn/comm_spawn.c:118 #7 0x0000000000400c88 in main (argc=1, argv=0x7fffffffafd8, envp=0x7fffffffafe8) at spawn_example.c:53 ================= source code ===================== #include "mpi.h" #include <stdio.h> #include <stdlib.h> #define NUM_SPAWNS 56 int main(int argc, char* argv[]) { int np = NUM_SPAWNS; int errorcodes[NUM_SPAWNS]; MPI_Comm parentcomm, intercomm; MPI_Init(&argc, &argv); MPI_Comm_get_parent(&parentcomm); int rank; MPI_Comm_rank(MPI_COMM_WORLD, &rank); if (MPI_COMM_NULL == parentcomm) { MPI_Comm_spawn("spawn_example", MPI_ARGV_NULL, np, MPI_INFO_NULL, 0, MPI_COMM_WORLD, &intercomm, errorcodes); printf("I'm the parent with rank=%d\n", rank); } else { printf("I'm spawned with rank %d\n", rank); } fflush(stdout); MPI_Finalize(); return 0; } Thank you for your insight, All the best, Menno Deij - van Rijswijk dr. ir. Menno A. Deij-van Rijswijk | Researcher | Research & Development MARIN | T +31 317 49 35 06 | mailto:m.d...@marin.nl | http://www.marin.nl