Diego, I'm *guessing* that you are tripping over the use of "--tasks 32" on a heterogeneous cluster, though your comment about the node without InfiniBand troubles me. If you drain that node, or exclude it in your command line, that might correct the problem. I wonder if OMPI and PMIx have decided that IB is the way to go, and are failing when they try to set up on the node without IB.
If that's not it, I'd try 0. Check sacct for the node lists for the successful and unsuccessful runs -- a problem node might jump out. 1. Running your job with explicit node lists. Again, you may find a problem node this way. HTH! Andy p.s. If this doesn't fix it, please include the Slurm and OMPI versions, and a copy of your slurm.conf file (with identifying information like node names removed) in your next note to this list. -----Original Message----- From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of Diego Zuccato Sent: Friday, June 5, 2020 9:08 AM To: Slurm User Community List <slurm-users@lists.schedmd.com> Subject: [slurm-users] Intermittent problem at 32 CPUs Hello all. I already tried for some weeks to debug this problem, but it seems I'm still missing something. I have a small, (very) heterogeneous cluster. After upgrading to Debian 10 and packaged versions of Slurm and IB drivers/tools, I noticed that *sometimes* jobs requesting 32 or more threads fail with an error like: -8<-- [str957-bl0-19:30411] *** Process received signal *** [str957-bl0-19:30411] Signal: Segmentation fault (11) [str957-bl0-19:30411] Signal code: Address not mapped (1) [str957-bl0-19:30411] Failing at address: 0x7fb206380008 [str957-bl0-19:30411] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x37840)[0x7fb205eb7840] [str957-bl0-19:30411] [ 1] /usr/lib/x86_64-linux-gnu/pmix/lib/pmix/mca_gds_ds21.so(+0x2936)[0x7fb200ac2936] [str957-bl0-19:30411] [ 2] /usr/lib/x86_64-linux-gnu/libmca_common_dstore.so.1(pmix_common_dstor_init+0x9d3)[0x7fb200a92733] [str957-bl0-19:30411] [ 3] /usr/lib/x86_64-linux-gnu/pmix/lib/pmix/mca_gds_ds21.so(+0x25b4)[0x7fb200ac25b4] [str957-bl0-19:30411] [ 4] /usr/lib/x86_64-linux-gnu/libpmix.so.2(pmix_gds_base_select+0x12e)[0x7fb200bba46e] [str957-bl0-19:30411] [ 5] /usr/lib/x86_64-linux-gnu/libpmix.so.2(pmix_rte_init+0x8cd)[0x7fb200b7288d] [str957-bl0-19:30411] [ 6] /usr/lib/x86_64-linux-gnu/libpmix.so.2(PMIx_Init+0xdc)[0x7fb200b2ed7c] [str957-bl0-19:30411] [ 7] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_pmix_ext2x.so(ext2x_client_init+0xc4)[0x7fb200c35fe4] [str957-bl0-19:30411] [ 8] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_ess_pmi.so(+0x2656)[0x7fb201462656] [str957-bl0-19:30411] [ 9] /usr/lib/x86_64-linux-gnu/libopen-rte.so.40(orte_init+0x29a)[0x7fb202a9211a] [str957-bl0-19:30411] [10] /usr/lib/x86_64-linux-gnu/libmpi.so.40(ompi_mpi_init+0x252)[0x7fb203f23e62] [str957-bl0-19:30411] [11] /usr/lib/x86_64-linux-gnu/libmpi.so.40(PMPI_Init_thread+0x55)[0x7fb203f522d5] -8<-- Just changing --ntasks=32 to --ntasks=30 (or less) lets it run w/o problems. *Sometimes* it works even with --ntasks=32. But the most absurd thing I've seen is this (just changing the step in the batch job): -8<-- mpirun ./mpitest => KO gdb -batch -n -ex 'set pagination off' -ex run -ex bt -ex 'bt full' -ex 'thread apply all bt full' --args mpirun --mca btl openib --mca mtl psm2 ./mpitest-debug => OK mpirun --mca btl openib --mca mtl psm2 ./mpitest-debug => OK mpirun --mca mtl psm2 ./mpitest-debug => OK mpirun ./mpitest-debug => OK mpirun ./mpitest => OK?!?!?!?! -8<-- At the end, *the same* command that consistently failed, started to run. The currently problematic node is one w/o InfiniBand, so that can probably be ruled out. Any hints? TIA. -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786