Your message dated Wed, 16 Apr 2025 12:05:13 +0100
with message-id <d91d2a47-69d2-4c35-8924-dbcf9ba09...@mckinstry.ie>
and subject line Re: Bug#1102068: libfabric: FTBFS on 32-bit arches: ofi_cma.h:
error: passing argument 2 of 'ofi_consume_iov' from incompatible pointer type
has caused the Debian Bug report #1102612,
regarding mpich 4.3 not initialising multiple processes
to be marked as done.
This means that you claim that the problem has been dealt with.
If this is not the case it is now your responsibility to reopen the
Bug report if necessary, and/or fix the problem forthwith.
(NB: If you are a system administrator and have no idea what this
message is talking about, this may indicate a serious mail system
misconfiguration somewhere. Please contact ow...@bugs.debian.org
immediately.)
--
1102612: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1102612
Debian Bug Tracking System
Contact ow...@bugs.debian.org with problems
--- Begin Message ---
Package: mpich
Version: 4.3.0-5
Severity: serious
Justification: debci
I apologise for another serious bug, but mpich 4.3 is doing weird
things that we don't want in trixie. I see the problem in mpich test
errors in armci-mpi
(https://buildd.debian.org/status/fetch.php?pkg=armci-mpi&arch=amd64&ver=0.4-5&stamp=1744327219&raw=0
)
but can reproduce in a trivial test.
The problem is that mpich is not initialising multiple processes.
Instead it is simply launching multiple single processes (each with
MPI_Comm_size = 1).
You can see the problem in the armci-mpi test errors, e.g.
FAIL: benchmarks/ping-pong
==========================
[1744327153.607644] [sbuild:19884:0] sock.c:513 UCX WARN unable
to read somaxconn value from /proc/sys/net/core/somaxconn file
[0] ARMCI Error: This benchmark should be run on at least two processes
Abort(1) on node 0 (rank 0 in comm 496): application called
MPI_Abort(comm=0x84000000, 1) - process 0
[0] ARMCI Error: This benchmark should be run on at least two processes
[1744327153.612861] [sbuild:19883:0] sock.c:513 UCX WARN unable
to read somaxconn value from /proc/sys/net/core/somaxconn file
Abort(1) on node 0 (rank 0 in comm 496): application called
MPI_Abort(comm=0x84000000, 1) - process 0
FAIL benchmarks/ping-pong (exit status: 1)
The error message "at least two processes" is issued by armci-mpi's ping-pong.c,
when it detects MPI_Comm_size = 1. But the test is launched with
mpiexec.mpich -np 2 (that's why the error is repeated twice).
I can reproduce the issue with a trivial test:
```
$ cat mpich_test.c
#include <stdio.h>
#include <mpi.h>
int main(int argc, char **argv) {
int me, nproc;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &me);
MPI_Comm_size(MPI_COMM_WORLD, &nproc);
printf("mpi test rank %d of %d\n", me, nproc);
MPI_Finalize();
return 0;
}
$ mpicc.mpich -o mpich_test mpich_test.c
$ mpiexec.mpich -n 4 ./mpich_test
mpi test rank 0 of 1
mpi test rank 0 of 1
mpi test rank 0 of 1
mpi test rank 0 of 1
```
It should instead be reporting
mpi test rank 3 of 4
mpi test rank 1 of 4
mpi test rank 0 of 4
mpi test rank 2 of 4
There is even more weirdness however. The first time I compiled and
ran this trivial test, it did report having 4 processes, but that
correct output was accompanied with pmix warnings:
Query for unrecognized attribute: pmix.qry.node
Query for unrecognized attribute: pmix.qry.peers
But after recompiling the same way, it no longer gave the correct output
but also did not give the pmix warnings.
Can you reproduce this problem?
-- System Information:
Debian Release: trixie/sid
APT prefers unstable-debug
APT policy: (500, 'unstable-debug'), (500, 'unstable'), (1, 'experimental')
Architecture: amd64 (x86_64)
Foreign Architectures: i386
Kernel: Linux 6.12.21-amd64 (SMP w/8 CPU threads; PREEMPT)
Kernel taint flags: TAINT_PROPRIETARY_MODULE, TAINT_OOT_MODULE
Locale: LANG=en_AU.UTF-8, LC_CTYPE=en_AU.UTF-8 (charmap=UTF-8),
LANGUAGE=en_AU:en
Shell: /bin/sh linked to /usr/bin/dash
Init: systemd (via /run/systemd/system)
LSM: AppArmor: enabled
Versions of packages mpich depends on:
ii hwloc 2.12.0-1
ii libc6 2.41-6
ii libhwloc15 2.12.0-1
ii libmpich12 4.3.0-5
ii libslurm42t64 24.11.3-2
ii perl 5.40.1-2
Versions of packages mpich recommends:
ii libmpich-dev 4.3.0-5
Versions of packages mpich suggests:
ii mpich-doc 4.3.0-5
-- no debconf information
--- End Message ---
--- Begin Message ---
Apologies this was supposed to be #1102612
Closing that as PMIX is now disabled; hydra in mpich (the mpiexec
daemon) does not currently support pmix
(which was added to mpich recently)
Regards
Alastair
On 16/04/2025 09:54, Drew Parsons wrote:
Source: libfabric
Version: 2.1.0-1
Followup-For: Bug #1102068
Control: tags -1 ftbfs
Control: reopen -1
I think that bug closed by mpich 4.3.0-6 was meant to be one of the
other mpich bugs (#1102612).
32-bit arches are still failing to build libfabric 2.1.0-1 the same way,
so reopening this bug.
--
Alastair McKinstry,
GPG: 82383CE9165B347C787081A2CBE6BB4E5D9AD3A5
e: alast...@mckinstry.ie, im: @alastair:mckinstry.ie @amckins...@mastodon.ie
Commander Vimes didn’t like the phrase “The innocent have nothing to fear,”
believing the innocent had everything to fear, mostly from the guilty but in
the longer term
even more from those who say things like “The innocent have nothing to fear.”
- T. Pratchett, Snuff
--- End Message ---