Package: libopenmpi3
Version: 4.1.2~rc1-4
Severity: important
Control: affects -1 src:fenics-dolfinx

fenics-dolfinx FTBFS on 32-bit arches, i386, armel, armhf, see
https://buildd.debian.org/status/package.php?p=fenics-dolfinx&suite=experimental
https://buildd.debian.org/status/fetch.php?pkg=fenics-dolfinx&arch=i386&ver=1%3A0.3.0-3&stamp=1633022115&raw=0

(ignore the 2021-10-02 failed builds using libhdf5-mpi-dev
1.10.7+repack-2, they need libcurl-dev, i.e. libcurl4-openssl-dev, see
Bug##995594 )

The symptom in dolfinx build logs is
  signal number 11 SEGV: Segmentation Violation, probably memory access out of 
range
when running demo_poisson_mpi_2

PETSc suggests running with -start_in_debugger. When I do that on i386
porterbox and run demo_poisson manually with 2 processes, it gives a
more detailed backtrace:

(experimental_i386-dchroot)barriere$ mpiexec -n 2 ./demo_poisson 
-start_in_debugger 
PETSC: Attaching gdb to ./demo_poisson of pid 5638 on display :0.0 on machine 
barriere
PETSC: Attaching gdb to ./demo_poisson of pid 5639 on display :0.0 on machine 
barriere
Unable to start debugger in xterm: No such file or directory
Unable to start debugger in xterm: No such file or directory
[0]PETSC ERROR: 
------------------------------------------------------------------------
[0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably 
memory access out of range
[0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[0]PETSC ERROR: or see 
https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
[0]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to 
find memory corruption errors
[0]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run 
[0]PETSC ERROR: to get more information on the crash.
[0]PETSC ERROR: PetscAbortErrorHandler: User provided function() line 0 in  
unknown file (null)
  To prevent termination, change the error handler using PetscPushErrorHandler()
[barriere:05638] *** Process received signal ***
[barriere:05638] Signal: Aborted (6)
[barriere:05638] Signal code:  (-6)
[barriere:05638] [ 0] linux-gate.so.1(__kernel_rt_sigreturn+0x0)[0xf7f32090]
[barriere:05638] [ 1] linux-gate.so.1(__kernel_vsyscall+0x9)[0xf7f32069]
[barriere:05638] [ 2] /lib/i386-linux-gnu/libc.so.6(gsignal+0xc6)[0xf5f00f36]
[barriere:05638] [ 3] /lib/i386-linux-gnu/libc.so.6(abort+0x125)[0xf5ee9312]
[barriere:05638] [ 4] 
/usr/lib/petscdir/petsc3.14/i386-linux-gnu-real/lib/libpetsc_real.so.3.14(+0x153d26)[0xf653dd26]
[barriere:05638] [ 5] 
/usr/lib/petscdir/petsc3.14/i386-linux-gnu-real/lib/libpetsc_real.so.3.14(PetscError+0xd0)[0xf653a3b0]
[barriere:05638] [ 6] 
/usr/lib/petscdir/petsc3.14/i386-linux-gnu-real/lib/libpetsc_real.so.3.14(PetscSignalHandlerDefault+0x1a0)[0xf653e790]
[barriere:05638] [ 7] 
/usr/lib/petscdir/petsc3.14/i386-linux-gnu-real/lib/libpetsc_real.so.3.14(+0x154979)[0xf653e979]
[barriere:05638] [ 8] linux-gate.so.1(__kernel_sigreturn+0x0)[0xf7f32080]
[barriere:05638] [ 9] 
/fenics/fenics-dolfinx-0.3.0/debian/tmp-real/usr/lib/i386-linux-gnu/libdolfinx_real.so.0.3(_ZN7dolfinx4mesh16build_dual_graphEP19ompi_communicator_tRKNS_5graph13AdjacencyListIxEEi+0xd7f)[0xf7ead68f]
[barriere:05638] [10] 
/fenics/fenics-dolfinx-0.3.0/debian/tmp-real/usr/lib/i386-linux-gnu/libdolfinx_real.so.0.3(_ZN7dolfinx4mesh21partition_cells_graphEP19ompi_communicator_tiiRKNS_5graph13AdjacencyListIxEENS0_9GhostModeERKSt8functionIFNS4_IiEES2_iS7_ibEE+0x21d)[0xf7ebeb9d]
[barriere:05638] [11] 
/fenics/fenics-dolfinx-0.3.0/debian/tmp-real/usr/lib/i386-linux-gnu/libdolfinx_real.so.0.3(_ZN7dolfinx4mesh21partition_cells_graphEP19ompi_communicator_tiiRKNS_5graph13AdjacencyListIxEENS0_9GhostModeE+0x59)[0xf7ebece9]
[barriere:05638] [12] 
/fenics/fenics-dolfinx-0.3.0/debian/tmp-real/usr/lib/i386-linux-gnu/libdolfinx_real.so.0.3(_ZNSt17_Function_handlerIFKN7dolfinx5graph13AdjacencyListIiEEP19ompi_communicator_tiiRKNS2_IxEENS0_4mesh9GhostModeEEPFS3_S6_iiS9_SB_EE9_M_invokeERKSt9_Any_dataOS6_OiSK_S9_OSB_+0x35)[0xf7dff8b5]
[barriere:05638] [13] 
/fenics/fenics-dolfinx-0.3.0/debian/tmp-real/usr/lib/i386-linux-gnu/libdolfinx_real.so.0.3(_ZN7dolfinx4mesh11create_meshEP19ompi_communicator_tRKNS_5graph13AdjacencyListIxEERKNS_3fem17CoordinateElementERKN2xt17xtensor_containerINSC_7uvectorIdSaIdEEELj2ELNSC_11layout_typeE1ENSC_22xtensor_expression_tagEEENS0_9GhostModeERKSt8functionIFKNS4_IiEES2_iiS7_SM_EE+0x163)[0xf7e96d63]
[barriere:05638] [14] 
/fenics/fenics-dolfinx-0.3.0/debian/tmp-real/usr/lib/i386-linux-gnu/libdolfinx_real.so.0.3(+0x10fdab)[0xf7dfedab]
[barriere:05638] [15] 
/fenics/fenics-dolfinx-0.3.0/debian/tmp-real/usr/lib/i386-linux-gnu/libdolfinx_real.so.0.3(_ZN7dolfinx10generation13RectangleMesh6createEP19ompi_communicator_tRKSt5arrayIS4_IdLj3EELj2EES4_IjLj2EENS_4mesh8CellTypeENSA_9GhostModeERKSt8functionIFKNS_5graph13AdjacencyListIiEES3_iiRKNSF_IxEESC_EERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0xb8)[0xf7dff7c8]
[barriere:05638] [16] 
/fenics/fenics-dolfinx-0.3.0/debian/tmp-real/usr/lib/i386-linux-gnu/libdolfinx_real.so.0.3(_ZN7dolfinx10generation13RectangleMesh6createEP19ompi_communicator_tRKSt5arrayIS4_IdLj3EELj2EES4_IjLj2EENS_4mesh8CellTypeENSA_9GhostModeERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x5f)[0xf7dff83f]
[barriere:05638] [17] ./demo_poisson(+0x19953)[0x565ed953]
[barriere:05638] [18] 
/lib/i386-linux-gnu/libc.so.6(__libc_start_main+0x106)[0xf5eeafd6]
[barriere:05638] [19] ./demo_poisson(+0x18451)[0x565ec451]
[barriere:05638] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
[1]PETSC ERROR: 
------------------------------------------------------------------------
[1]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the batch 
system) has told this process to end
[1]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[1]PETSC ERROR: or see 
https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
[1]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to 
find memory corruption errors
[1]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run 
[1]PETSC ERROR: to get more information on the crash.
[1]PETSC ERROR: PetscAbortErrorHandler: User provided function() line 0 in  
unknown file (null)
  To prevent termination, change the error handler using PetscPushErrorHandler()
[barriere:05639] *** Process received signal ***
[barriere:05639] Signal: Aborted (6)
[barriere:05639] Signal code:  (-6)
[barriere:05639] [ 0] linux-gate.so.1(__kernel_rt_sigreturn+0x0)[0xf7f66090]
[barriere:05639] [ 1] linux-gate.so.1(__kernel_vsyscall+0x9)[0xf7f66069]
[barriere:05639] [ 2] /lib/i386-linux-gnu/libc.so.6(gsignal+0xc6)[0xf5f34f36]
[barriere:05639] [ 3] /lib/i386-linux-gnu/libc.so.6(abort+0x125)[0xf5f1d312]
[barriere:05639] [ 4] 
/usr/lib/petscdir/petsc3.14/i386-linux-gnu-real/lib/libpetsc_real.so.3.14(+0x153d26)[0xf6571d26]
[barriere:05639] [ 5] 
/usr/lib/petscdir/petsc3.14/i386-linux-gnu-real/lib/libpetsc_real.so.3.14(PetscError+0xd0)[0xf656e3b0]
[barriere:05639] [ 6] 
/usr/lib/petscdir/petsc3.14/i386-linux-gnu-real/lib/libpetsc_real.so.3.14(PetscSignalHandlerDefault+0x1a0)[0xf6572790]
[barriere:05639] [ 7] 
/usr/lib/petscdir/petsc3.14/i386-linux-gnu-real/lib/libpetsc_real.so.3.14(+0x154979)[0xf6572979]
[barriere:05639] [ 8] linux-gate.so.1(__kernel_sigreturn+0x0)[0xf7f66080]
[barriere:05639] [ 9] 
/usr/lib/i386-linux-gnu/openmpi/lib/openmpi3/mca_btl_vader.so(+0x4cd6)[0xf1354cd6]
[barriere:05639] [10] 
/usr/lib/i386-linux-gnu/libopen-pal.so.40(opal_progress+0x30)[0xf4dcde70]
[barriere:05639] [11] 
/usr/lib/i386-linux-gnu/libopen-pal.so.40(ompi_sync_wait_mt+0xbd)[0xf4dd4a5d]
[barriere:05639] [12] 
/usr/lib/i386-linux-gnu/libmpi.so.40(ompi_request_default_wait+0x236)[0xf7a1b2c6]
[barriere:05639] [13] 
/usr/lib/i386-linux-gnu/libmpi.so.40(ompi_coll_base_sendrecv_actual+0xbb)[0xf7a73b2b]
[barriere:05639] [14] 
/usr/lib/i386-linux-gnu/libmpi.so.40(ompi_coll_base_alltoall_intra_pairwise+0xf7)[0xf7a77b67]
[barriere:05639] [15] 
/usr/lib/i386-linux-gnu/openmpi/lib/openmpi3/mca_coll_tuned.so(ompi_coll_tuned_alltoall_intra_do_this+0x11d)[0xf11b28ed]
[barriere:05639] [16] 
/usr/lib/i386-linux-gnu/openmpi/lib/openmpi3/mca_coll_tuned.so(ompi_coll_tuned_alltoall_intra_dec_fixed+0x99)[0xf11adca9]
[barriere:05639] [17] 
/usr/lib/i386-linux-gnu/libmpi.so.40(MPI_Alltoall+0x182)[0xf7a2f6d2]
[barriere:05639] [18] 
/fenics/fenics-dolfinx-0.3.0/debian/tmp-real/usr/lib/i386-linux-gnu/libdolfinx_real.so.0.3(_ZN7dolfinx3MPI10all_to_allIxEENS_5graph13AdjacencyListIT_EEP19ompi_communicator_tRKS5_+0x15d)[0xf7eb95ed]
[barriere:05639] [19] 
/fenics/fenics-dolfinx-0.3.0/debian/tmp-real/usr/lib/i386-linux-gnu/libdolfinx_real.so.0.3(_ZN7dolfinx4mesh16build_dual_graphEP19ompi_communicator_tRKNS_5graph13AdjacencyListIxEEi+0xdd6)[0xf7ee16e6]
[barriere:05639] [20] 
/fenics/fenics-dolfinx-0.3.0/debian/tmp-real/usr/lib/i386-linux-gnu/libdolfinx_real.so.0.3(_ZN7dolfinx4mesh21partition_cells_graphEP19ompi_communicator_tiiRKNS_5graph13AdjacencyListIxEENS0_9GhostModeERKSt8functionIFNS4_IiEES2_iS7_ibEE+0x21d)[0xf7ef2b9d]
[barriere:05639] [21] 
/fenics/fenics-dolfinx-0.3.0/debian/tmp-real/usr/lib/i386-linux-gnu/libdolfinx_real.so.0.3(_ZN7dolfinx4mesh21partition_cells_graphEP19ompi_communicator_tiiRKNS_5graph13AdjacencyListIxEENS0_9GhostModeE+0x59)[0xf7ef2ce9]
[barriere:05639] [22] 
/fenics/fenics-dolfinx-0.3.0/debian/tmp-real/usr/lib/i386-linux-gnu/libdolfinx_real.so.0.3(_ZNSt17_Function_handlerIFKN7dolfinx5graph13AdjacencyListIiEEP19ompi_communicator_tiiRKNS2_IxEENS0_4mesh9GhostModeEEPFS3_S6_iiS9_SB_EE9_M_invokeERKSt9_Any_dataOS6_OiSK_S9_OSB_+0x35)[0xf7e338b5]
[barriere:05639] [23] 
/fenics/fenics-dolfinx-0.3.0/debian/tmp-real/usr/lib/i386-linux-gnu/libdolfinx_real.so.0.3(_ZN7dolfinx4mesh11create_meshEP19ompi_communicator_tRKNS_5graph13AdjacencyListIxEERKNS_3fem17CoordinateElementERKN2xt17xtensor_containerINSC_7uvectorIdSaIdEEELj2ELNSC_11layout_typeE1ENSC_22xtensor_expression_tagEEENS0_9GhostModeERKSt8functionIFKNS4_IiEES2_iiS7_SM_EE+0x163)[0xf7ecad63]
[barriere:05639] [24] 
/fenics/fenics-dolfinx-0.3.0/debian/tmp-real/usr/lib/i386-linux-gnu/libdolfinx_real.so.0.3(+0x10fbb2)[0xf7e32bb2]
[barriere:05639] [25] 
/fenics/fenics-dolfinx-0.3.0/debian/tmp-real/usr/lib/i386-linux-gnu/libdolfinx_real.so.0.3(_ZN7dolfinx10generation13RectangleMesh6createEP19ompi_communicator_tRKSt5arrayIS4_IdLj3EELj2EES4_IjLj2EENS_4mesh8CellTypeENSA_9GhostModeERKSt8functionIFKNS_5graph13AdjacencyListIiEES3_iiRKNSF_IxEESC_EERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0xb8)[0xf7e337c8]
[barriere:05639] [26] 
/fenics/fenics-dolfinx-0.3.0/debian/tmp-real/usr/lib/i386-linux-gnu/libdolfinx_real.so.0.3(_ZN7dolfinx10generation13RectangleMesh6createEP19ompi_communicator_tRKSt5arrayIS4_IdLj3EELj2EES4_IjLj2EENS_4mesh8CellTypeENSA_9GhostModeERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x5f)[0xf7e3383f]
[barriere:05639] [27] ./demo_poisson(+0x19953)[0x5658d953]
[barriere:05639] [28] 
/lib/i386-linux-gnu/libc.so.6(__libc_start_main+0x106)[0xf5f1efd6]
[barriere:05639] [29] ./demo_poisson(+0x18451)[0x5658c451]
[barriere:05639] *** End of error message ***


Since there two processes, there are 2 backtraces here.  The first is
in libdolfinx_real.so and reaches dolfinx::mesh::build_dual_graph
which seems to be using ompi_communicator to access dolfinx's
graph::AdjacencyList, before the segfault is reached.

The second process seems to be accessing MPI::all_to_all for the
graph::AdjacencyList and then goes from libdolfinx_real.so to
libmpi.so, then libopen-pal.so and finally reaches mca_btl_vader.so
before hitting the segfault.

So if I'm reading the trace correctly, the segfault is triggered
inside mca_btl_vader.so

This is kind of Severity:serious (FTBFS), except that fenics-dolfinx
is in experimental, and the old version of dolfinx in unstable doesn't
crash this way.  I'm hoping to push the new version of dolfinx to
unstable soon though.

Reply via email to