Diego,

I'm *guessing* that you are tripping over the use of "--tasks 32" on a 
heterogeneous cluster, though your comment about the node without InfiniBand 
troubles me. If you drain that node, or exclude it in your command line, that 
might correct the problem. I wonder if OMPI and PMIx have decided that IB is 
the way to go, and are failing when they try to set up on the node without IB.

If that's not it, I'd try
0. Check sacct for the node lists for the successful and unsuccessful runs -- a 
problem node might jump out.
1. Running your job with explicit node lists. Again, you may find a problem 
node this way.

HTH!
Andy

p.s. If this doesn't fix it, please include the Slurm and OMPI versions, and a 
copy of your slurm.conf file (with identifying information like node names 
removed) in your next note to this list.

-----Original Message-----
From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
Diego Zuccato
Sent: Friday, June 5, 2020 9:08 AM
To: Slurm User Community List <slurm-users@lists.schedmd.com>
Subject: [slurm-users] Intermittent problem at 32 CPUs

Hello all.

I already tried for some weeks to debug this problem, but it seems I'm
still missing something.
I have a small, (very) heterogeneous cluster. After upgrading to Debian
10 and packaged versions of Slurm and IB drivers/tools, I noticed that
*sometimes* jobs requesting 32 or more threads fail with an error like:
-8<--
[str957-bl0-19:30411] *** Process received signal ***
[str957-bl0-19:30411] Signal: Segmentation fault (11)
[str957-bl0-19:30411] Signal code: Address not mapped (1)
[str957-bl0-19:30411] Failing at address: 0x7fb206380008
[str957-bl0-19:30411] [ 0]
/lib/x86_64-linux-gnu/libc.so.6(+0x37840)[0x7fb205eb7840]
[str957-bl0-19:30411] [ 1]
/usr/lib/x86_64-linux-gnu/pmix/lib/pmix/mca_gds_ds21.so(+0x2936)[0x7fb200ac2936]
[str957-bl0-19:30411] [ 2]
/usr/lib/x86_64-linux-gnu/libmca_common_dstore.so.1(pmix_common_dstor_init+0x9d3)[0x7fb200a92733]
[str957-bl0-19:30411] [ 3]
/usr/lib/x86_64-linux-gnu/pmix/lib/pmix/mca_gds_ds21.so(+0x25b4)[0x7fb200ac25b4]
[str957-bl0-19:30411] [ 4]
/usr/lib/x86_64-linux-gnu/libpmix.so.2(pmix_gds_base_select+0x12e)[0x7fb200bba46e]
[str957-bl0-19:30411] [ 5]
/usr/lib/x86_64-linux-gnu/libpmix.so.2(pmix_rte_init+0x8cd)[0x7fb200b7288d]
[str957-bl0-19:30411] [ 6]
/usr/lib/x86_64-linux-gnu/libpmix.so.2(PMIx_Init+0xdc)[0x7fb200b2ed7c]
[str957-bl0-19:30411] [ 7]
/usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_pmix_ext2x.so(ext2x_client_init+0xc4)[0x7fb200c35fe4]
[str957-bl0-19:30411] [ 8]
/usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_ess_pmi.so(+0x2656)[0x7fb201462656]
[str957-bl0-19:30411] [ 9]
/usr/lib/x86_64-linux-gnu/libopen-rte.so.40(orte_init+0x29a)[0x7fb202a9211a]
[str957-bl0-19:30411] [10]
/usr/lib/x86_64-linux-gnu/libmpi.so.40(ompi_mpi_init+0x252)[0x7fb203f23e62]
[str957-bl0-19:30411] [11]
/usr/lib/x86_64-linux-gnu/libmpi.so.40(PMPI_Init_thread+0x55)[0x7fb203f522d5]
-8<--
Just changing --ntasks=32 to --ntasks=30 (or less) lets it run w/o problems.
*Sometimes* it works even with --ntasks=32.
But the most absurd thing I've seen is this (just changing the step in
the batch job):
-8<--
mpirun ./mpitest => KO
gdb -batch -n -ex 'set pagination off' -ex run -ex bt -ex 'bt full' -ex
'thread apply all bt full' --args mpirun --mca btl openib --mca mtl psm2
./mpitest-debug => OK
mpirun --mca btl openib --mca mtl psm2 ./mpitest-debug => OK
mpirun --mca mtl psm2 ./mpitest-debug => OK
mpirun ./mpitest-debug => OK
mpirun ./mpitest => OK?!?!?!?!
-8<--

At the end, *the same* command that consistently failed, started to run.
The currently problematic node is one w/o InfiniBand, so that can
probably be ruled out.

Any hints?

TIA.

-- 
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786


Reply via email to