Hello all. I already tried for some weeks to debug this problem, but it seems I'm still missing something. I have a small, (very) heterogeneous cluster. After upgrading to Debian 10 and packaged versions of Slurm and IB drivers/tools, I noticed that *sometimes* jobs requesting 32 or more threads fail with an error like: -8<-- [str957-bl0-19:30411] *** Process received signal *** [str957-bl0-19:30411] Signal: Segmentation fault (11) [str957-bl0-19:30411] Signal code: Address not mapped (1) [str957-bl0-19:30411] Failing at address: 0x7fb206380008 [str957-bl0-19:30411] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x37840)[0x7fb205eb7840] [str957-bl0-19:30411] [ 1] /usr/lib/x86_64-linux-gnu/pmix/lib/pmix/mca_gds_ds21.so(+0x2936)[0x7fb200ac2936] [str957-bl0-19:30411] [ 2] /usr/lib/x86_64-linux-gnu/libmca_common_dstore.so.1(pmix_common_dstor_init+0x9d3)[0x7fb200a92733] [str957-bl0-19:30411] [ 3] /usr/lib/x86_64-linux-gnu/pmix/lib/pmix/mca_gds_ds21.so(+0x25b4)[0x7fb200ac25b4] [str957-bl0-19:30411] [ 4] /usr/lib/x86_64-linux-gnu/libpmix.so.2(pmix_gds_base_select+0x12e)[0x7fb200bba46e] [str957-bl0-19:30411] [ 5] /usr/lib/x86_64-linux-gnu/libpmix.so.2(pmix_rte_init+0x8cd)[0x7fb200b7288d] [str957-bl0-19:30411] [ 6] /usr/lib/x86_64-linux-gnu/libpmix.so.2(PMIx_Init+0xdc)[0x7fb200b2ed7c] [str957-bl0-19:30411] [ 7] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_pmix_ext2x.so(ext2x_client_init+0xc4)[0x7fb200c35fe4] [str957-bl0-19:30411] [ 8] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_ess_pmi.so(+0x2656)[0x7fb201462656] [str957-bl0-19:30411] [ 9] /usr/lib/x86_64-linux-gnu/libopen-rte.so.40(orte_init+0x29a)[0x7fb202a9211a] [str957-bl0-19:30411] [10] /usr/lib/x86_64-linux-gnu/libmpi.so.40(ompi_mpi_init+0x252)[0x7fb203f23e62] [str957-bl0-19:30411] [11] /usr/lib/x86_64-linux-gnu/libmpi.so.40(PMPI_Init_thread+0x55)[0x7fb203f522d5] -8<-- Just changing --ntasks=32 to --ntasks=30 (or less) lets it run w/o problems. *Sometimes* it works even with --ntasks=32. But the most absurd thing I've seen is this (just changing the step in the batch job): -8<-- mpirun ./mpitest => KO gdb -batch -n -ex 'set pagination off' -ex run -ex bt -ex 'bt full' -ex 'thread apply all bt full' --args mpirun --mca btl openib --mca mtl psm2 ./mpitest-debug => OK mpirun --mca btl openib --mca mtl psm2 ./mpitest-debug => OK mpirun --mca mtl psm2 ./mpitest-debug => OK mpirun ./mpitest-debug => OK mpirun ./mpitest => OK?!?!?!?! -8<--
At the end, *the same* command that consistently failed, started to run. The currently problematic node is one w/o InfiniBand, so that can probably be ruled out. Any hints? TIA. -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786