Control: reassign -1 src:gcc-14
Control: affects -1 = src:gcc-13 sbuild
Control: tags -1 + ftbfs patch upstream
Control: forwarded -1 https://github.com/dlang/phobos/pull/10586
Control: severity -1 serious

Hi Matthias, Emanuele and Jochen,

On Wed, Dec 04, 2024 at 08:05:58AM +0100, Emanuele Rocca wrote:
> Starting with version 0.86.0, sbuild using the unshare backend does not
> build gcc packages successfully, at least on arm64 and amd64. The
> version immediately before that, 0.85.11, works. Similar failures have
> been seen on other architectures, I think. Both gcc-13 and gcc-14 are
> affected, and probably others as well.
> 
> The following logs show sbuild 0.85.11 successfully building gcc-14:
> https://people.debian.org/~ema/gcc-14_14.2.0-9_arm64-2024-12-03T13:36:47Z.build
> 
> The issue occurs when running tests. To try and reduce the search space
> in terms of where the problem may be, as well as reduce build times a
> little bit, I've tried to single out one language for which the build
> fails. Such language seems to be D.
> 
>  DEB_BUILD_OPTIONS=nolang=ada,go,c,c++,fortran,objc,obj-c++,m2,rust
> 
> The above DEB_BUILD_OPTIONS results in gcc being built with:
> 
>  --enable-languages=c,d
> 
> This is 0.85.0 building gcc-14 correctly. I haven't tried 0.85.11, but
> it's very likely going to work as well.
> https://people.debian.org/~ema/gcc-14_14.2.0-9_arm64-2024-12-03T20:26:06Z.build
> 
> The following logs show sbuild 0.86.0 failing to build gcc-14, all other
> things being equal (including --enable-languages=c,d).
> https://people.debian.org/~ema/gcc-14_14.2.0-9_arm64-2024-12-03T18:58:58Z.build

This is a already a lot of useful clues. Before I got into looking into
this Emanuele and Jochen already figured out that replacing sbuild's
init written in perl with dumb-init (as it was using earlier) was making
it work, so that's the interesting change.

Today the three of us met virtually and further debugged the issue.
Eventually, Emanuele obtained the relevant process.exe test case from
the gcc-14 build and that really sped up further debugging. He managed
to produce a full strace of running it inside sbuild-usernsexec and a
variant of it patched to revert to dumb-init thanks to Jochen. It failed
in the former and worked in the latter.

Studying those straces is searching the needle in the haystack, but
eventually we found a difference. Both of them were issuing:

    kill(-2, SIGTERM)

In the succeeding test, this syscall would return -ESRCH. In the failing
one it would succeed. The interesting part is what is being killed here.
Negative numbers identify a progress group. When running dumb-init,
there would be a few forks before launching dpkg-buildpackage, so the
process group id of dpkg-buildpackage would end up being 16. In the perl
implementation, those forks were elided, so dpkg-buildpackage was
running as process group id 2. Due to the use of a pid namespace, it
would reliably end up being 2. So what is being killed here, is the
entire build.

Further digging into std/process.d revealed that the Pid class labels
the constant -2 as "terminated". The tryWait function changes the
processID value from the original value to -2 and the subsequent call to
kill then receives it. It seems that few people build gcc in a process
group with id 2.

In any case, I think 2 is a valid process group id and sbuild is
entitled to use that. This is not a sbuild bug, but regular build
failure. Once identified, locating the broken test case was manageable
and a patch has been forwarded to the phobos repository. I filed it
there, because phobos changes are synced into gcc.git rather than
committed directly.

I hope that this fully settles the matter. Whilst I am writing this
down, this very much is joint work of Emanuele, Jochen and me. It is a
result of circulating ideas, diagnostics and patches between us. I guess
that none of us would have found the cause today if being on their own.

Thank you for the sharing this adventure

Helmut

Reply via email to