Package: nwchem Version: 7.0.2-1 Severity: important Control: forwarded -1 https://github.com/pmodels/armci-mpi/issues/33 Control: affects -1 libarmci-mpi-dev openmpi-bin
The Debian testing build of nwchem is currently failing to run across multiple nodes. It runs fine on one node. The nodes form a cluster managed by openstack. 16 cpu per node Testing against the sample water script at https://nwchemgit.github.io/Sample.html, one node runs successfully with mpirun -n 16 nwchem water.nw I can also run successfully on a different (single) node (here launching from node-1 to execute on node-2) mpirun -H node-2:16 -n 16 nwchem water.nw The segfault occurs when I try to run on both nodes. Whether with -n 32 or -N 16, mpirun -H node-1:16,node-2:16 -n 32 nwchem water.nw or mpirun -H node-1:16,node-2:16 -N 32 nwchem water.nw both fail the same way. The error message is: $ mpirun -H node-1:16,node-2:16 -n 32 nwchem water.nw [31] ARMCI assert fail in gmr_create() [src/gmr.c:109]: "alloc_slices[alloc_me].base != NULL" [31] Backtrace: [31] 10 - nwchem(+0x2836605) [0x55fe1ee26605] [31] 9 - nwchem(+0x282cc1c) [0x55fe1ee1cc1c] [31] 8 - nwchem(+0x282c358) [0x55fe1ee1c358] [31] 7 - nwchem(+0x2819f68) [0x55fe1ee09f68] [31] 6 - nwchem(+0x2819cba) [0x55fe1ee09cba] [31] 5 - nwchem(+0x2819d76) [0x55fe1ee09d76] [31] 4 - nwchem(+0x2818fe9) [0x55fe1ee08fe9] [31] 3 - nwchem(+0x11b79) [0x55fe1c601b79] [31] 2 - nwchem(+0x12659) [0x55fe1c602659] [31] 1 - /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xcd) [0x7fb2c8ffa7ed] [31] 0 - nwchem(+0x1069a) [0x55fe1c60069a] -------------------------------------------------------------------------- MPI_ABORT was invoked on rank 31 in communicator MPI_COMM_WORLD with errorcode -1. NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them. -------------------------------------------------------------------------- -------------------------------------------------------------------------- An MPI communication peer process has unexpectedly disconnected. This usually indicates a failure in the peer process (e.g., a crash or otherwise exiting without calling MPI_FINALIZE first). Although this local MPI process will likely now behave unpredictably (it may even hang or crash), the root cause of this problem is the failure of the peer -- that is what you need to investigate. For example, there may be a core file that you can examine. More generally: such peer hangups are frequently caused by application bugs or other external events. Local host: node-1 Local PID: 1264980 Peer host: node-2 -------------------------------------------------------------------------- I've tried a fresh rebuild of armci-mpi, ga and nwchem, but the segfault is pervasive. I've tried running ARMCI_USE_WIN_ALLOCATE=0 as suggested on the armci-mpi README, but it doesn't avoid the segfault. After rebuilding against mpich (rebuilding armci-mpi and ga), an mpich build of nwchem runs fine. That suggests the problem lies in how openmpi works with armci. I'm inclined to work around the problem by just proceeding with mpich builds of nwchem. It's only two packages deep (armci-mpi and ga), and they both belong to nwchem anyway in practice, so wouldn't be too disruptive. -- System Information: Debian Release: bookworm/sid APT prefers unstable APT policy: (500, 'unstable'), (1, 'experimental') Architecture: amd64 (x86_64) Foreign Architectures: i386 Kernel: Linux 5.16.0-1-amd64 (SMP w/8 CPU threads; PREEMPT) Kernel taint flags: TAINT_PROPRIETARY_MODULE, TAINT_OOT_MODULE Locale: LANG=en_AU.UTF-8, LC_CTYPE=en_AU.UTF-8 (charmap=UTF-8), LANGUAGE=en_AU:en Shell: /bin/sh linked to /usr/bin/dash Init: systemd (via /run/systemd/system) LSM: AppArmor: enabled Versions of packages nwchem depends on: ii libatlas3-base [liblapack.so.3] 3.10.3-12 ii libblas3 [libblas.so.3] 3.10.0-2 ii libblis3-openmp [libblas.so.3] 0.8.1-2 ii libblis3-pthread [libblas.so.3] 0.8.1-2 ii libc6 2.33-6 ii libgcc-s1 11.2.0-16 ii libgfortran5 11.2.0-16 ii liblapack3 [liblapack.so.3] 3.10.0-2 ii libopenblas0-openmp [liblapack.so.3] 0.3.19+ds-3 ii libopenblas0-pthread [liblapack.so.3] 0.3.19+ds-3 ii libopenmpi3 4.1.2-1 ii libpython3.9 3.9.10-1 ii libscalapack-openmpi2.1 2.1.0-4 ii mpi-default-bin 1.14 ii nwchem-data 7.0.2-2 nwchem recommends no packages. nwchem suggests no packages. -- no debconf information