Il 08/10/20 08:48, Chris Samuel ha scritto: Sorry for being so late. I've had to wait for the node to be free.
> Launch it with "srun" rather than "mpirun", that way it'll be managed by > Slurm. If your test program then says every rank is rank 0 that will tell > you > OpenMPI is not built with Slurm support. Seems so: "The application appears to have been direct launched using "srun", but OMPI was not built with SLURM's PMI support and therefore cannot execute." So it seems I can't use srun to launch OpenMPI jobs. But sust s/srun/mpirun (that, IIUC, should be supported) it seems to work, and even auto-detects the corrent number of ranks to use. I launched the test executable with mpirun on one of the newer nodes (56 threads) and got: -8<-- [...] Hello from task 52 on str957-mtx-11! Hello from task 53 on str957-mtx-11! Hello from task 54 on str957-mtx-11! This is an MPI parallel code for Hello World with no communication Hello from task 0 on str957-mtx-11! MASTER: Number of MPI tasks is: 56 Hello from task 18 on str957-mtx-11! [...] -8<-- But if I run it on the older 32-thread node: -8<-- [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". [New Thread 0x7ffff480b700 (LWP 19633)] [New Thread 0x7ffff3fe9700 (LWP 19634)] [New Thread 0x7ffff3764700 (LWP 19635)] [New Thread 0x7ffff2f63700 (LWP 19636)] [Detaching after fork from child process 19637] [Detaching after fork from child process 19638] [Detaching after fork from child process 19639] [Detaching after fork from child process 19641] [Detaching after fork from child process 19643] [Detaching after fork from child process 19645] [Detaching after fork from child process 19647] [Detaching after fork from child process 19649] [Detaching after fork from child process 19651] [Detaching after fork from child process 19653] [Detaching after fork from child process 19655] [Detaching after fork from child process 19657] [Detaching after fork from child process 19659] [Detaching after fork from child process 19661] [Detaching after fork from child process 19663] [Detaching after fork from child process 19665] [Detaching after fork from child process 19667] [Detaching after fork from child process 19669] [Detaching after fork from child process 19671] [Detaching after fork from child process 19673] [Detaching after fork from child process 19675] [Detaching after fork from child process 19677] [Detaching after fork from child process 19679] [Detaching after fork from child process 19681] [Detaching after fork from child process 19683] [Detaching after fork from child process 19685] [Detaching after fork from child process 19687] [Detaching after fork from child process 19689] [Detaching after fork from child process 19691] [Detaching after fork from child process 19693] [Detaching after fork from child process 19695] [Detaching after fork from child process 19697] [str957-bl0-03:19637] *** Process received signal *** [str957-bl0-03:19637] Signal: Segmentation fault (11) [str957-bl0-03:19637] Signal code: Address not mapped (1) [str957-bl0-03:19637] Failing at address: 0x7ffff7fac008 [str957-bl0-03:19637] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12730)[0x7ffff7e92730] [str957-bl0-03:19637] [ 1] /usr/lib/x86_64-linux-gnu/pmix/lib/pmix/mca_gds_ds21.so(+0x2936)[0x7ffff646d936] [str957-bl0-03:19637] [ 2] /usr/lib/x86_64-linux-gnu/libmca_common_dstore.so.1(pmix_common_dstor_init+0x9d3)[0x7ffff6444733] [str957-bl0-03:19637] [ 3] /usr/lib/x86_64-linux-gnu/pmix/lib/pmix/mca_gds_ds21.so(+0x25b4)[0x7ffff646d5b4] [str957-bl0-03:19637] [ 4] /usr/lib/x86_64-linux-gnu/libpmix.so.2(pmix_gds_base_select+0x12e)[0x7ffff659346e] [str957-bl0-03:19637] [ 5] /usr/lib/x86_64-linux-gnu/libpmix.so.2(pmix_rte_init+0x8cd)[0x7ffff654b88d] [str957-bl0-03:19637] [ 6] /usr/lib/x86_64-linux-gnu/libpmix.so.2(PMIx_Init+0xdc)[0x7ffff6507d7c] [str957-bl0-03:19637] [ 7] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_pmix_ext2x.so(ext2x_client_init+0xc4)[0x7ffff6603fe4] [str957-bl0-03:19637] [ 8] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_ess_pmi.so(+0x2656)[0x7ffff7fb1656] [str957-bl0-03:19637] [ 9] /usr/lib/x86_64-linux-gnu/libopen-rte.so.40(orte_init+0x29a)[0x7ffff7c1c11a] [str957-bl0-03:19637] [10] /usr/lib/x86_64-linux-gnu/libmpi.so.40(ompi_mpi_init+0x252)[0x7ffff7eece62] [str957-bl0-03:19637] [11] /usr/lib/x86_64-linux-gnu/libmpi.so.40(MPI_Init+0x6e)[0x7ffff7f1b17e] [str957-bl0-03:19637] [12] ./mpitest-debug(+0x11c6)[0x5555555551c6] [str957-bl0-03:19637] [13] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb)[0x7ffff7ce309b] [str957-bl0-03:19637] [14] ./mpitest-debug(+0x10da)[0x5555555550da] [str957-bl0-03:19637] *** End of error message *** [... repeats the same error another 29 times ...] -------------------------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. -------------------------------------------------------------------------- [... repeats the error another 2 times ...] [Thread 0x7ffff480b700 (LWP 19633) exited] [Thread 0x7ffff3fe9700 (LWP 19634) exited] [Thread 0x7ffff2f63700 (LWP 19636) exited] [Thread 0x7ffff3764700 (LWP 19635) exited] [Inferior 1 (process 19626) exited with code 0213] No stack. No stack. -8<-- Some of the extra messages are from gdb. The job step line in the script is : gdb -batch -n -ex 'set pagination off' -ex run -ex bt -ex 'bt full' -ex 'thread apply all bt full' --args srun ./mpitest-debug The code is compiled w/ debug support. I'm quite lost... -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786