Control: reassign -1 src:gcc-14 Control: affects -1 = src:gcc-13 sbuild Control: tags -1 + ftbfs patch upstream Control: forwarded -1 https://github.com/dlang/phobos/pull/10586 Control: severity -1 serious
Hi Matthias, Emanuele and Jochen, On Wed, Dec 04, 2024 at 08:05:58AM +0100, Emanuele Rocca wrote: > Starting with version 0.86.0, sbuild using the unshare backend does not > build gcc packages successfully, at least on arm64 and amd64. The > version immediately before that, 0.85.11, works. Similar failures have > been seen on other architectures, I think. Both gcc-13 and gcc-14 are > affected, and probably others as well. > > The following logs show sbuild 0.85.11 successfully building gcc-14: > https://people.debian.org/~ema/gcc-14_14.2.0-9_arm64-2024-12-03T13:36:47Z.build > > The issue occurs when running tests. To try and reduce the search space > in terms of where the problem may be, as well as reduce build times a > little bit, I've tried to single out one language for which the build > fails. Such language seems to be D. > > DEB_BUILD_OPTIONS=nolang=ada,go,c,c++,fortran,objc,obj-c++,m2,rust > > The above DEB_BUILD_OPTIONS results in gcc being built with: > > --enable-languages=c,d > > This is 0.85.0 building gcc-14 correctly. I haven't tried 0.85.11, but > it's very likely going to work as well. > https://people.debian.org/~ema/gcc-14_14.2.0-9_arm64-2024-12-03T20:26:06Z.build > > The following logs show sbuild 0.86.0 failing to build gcc-14, all other > things being equal (including --enable-languages=c,d). > https://people.debian.org/~ema/gcc-14_14.2.0-9_arm64-2024-12-03T18:58:58Z.build This is a already a lot of useful clues. Before I got into looking into this Emanuele and Jochen already figured out that replacing sbuild's init written in perl with dumb-init (as it was using earlier) was making it work, so that's the interesting change. Today the three of us met virtually and further debugged the issue. Eventually, Emanuele obtained the relevant process.exe test case from the gcc-14 build and that really sped up further debugging. He managed to produce a full strace of running it inside sbuild-usernsexec and a variant of it patched to revert to dumb-init thanks to Jochen. It failed in the former and worked in the latter. Studying those straces is searching the needle in the haystack, but eventually we found a difference. Both of them were issuing: kill(-2, SIGTERM) In the succeeding test, this syscall would return -ESRCH. In the failing one it would succeed. The interesting part is what is being killed here. Negative numbers identify a progress group. When running dumb-init, there would be a few forks before launching dpkg-buildpackage, so the process group id of dpkg-buildpackage would end up being 16. In the perl implementation, those forks were elided, so dpkg-buildpackage was running as process group id 2. Due to the use of a pid namespace, it would reliably end up being 2. So what is being killed here, is the entire build. Further digging into std/process.d revealed that the Pid class labels the constant -2 as "terminated". The tryWait function changes the processID value from the original value to -2 and the subsequent call to kill then receives it. It seems that few people build gcc in a process group with id 2. In any case, I think 2 is a valid process group id and sbuild is entitled to use that. This is not a sbuild bug, but regular build failure. Once identified, locating the broken test case was manageable and a patch has been forwarded to the phobos repository. I filed it there, because phobos changes are synced into gcc.git rather than committed directly. I hope that this fully settles the matter. Whilst I am writing this down, this very much is joint work of Emanuele, Jochen and me. It is a result of circulating ideas, diagnostics and patches between us. I guess that none of us would have found the cause today if being on their own. Thank you for the sharing this adventure Helmut