Hi Simon, Thanks for having taken the time to do another extensive writeup. Much appreciated.
On Wed, Jun 26, 2024 at 06:11:09PM +0100, Simon McVittie wrote: > On Tue, 25 Jun 2024 at 18:55:45 +0200, Helmut Grohne wrote: > > The main difference to how everyone else does this is that in a typical > > sbuild interaction it will create a new user namespace for every single > > command run as part of the session. sbuild issues tens of commands > > before launching dpkg-buildpackage and each of them creates new > > namespaces in the Linux kernel (all of them using the same uid mappings, > > performing the same bind mounts and so on). The most common way to think > > of containers is different: You create those namespaces once and reuse > > the same namespace kernel objects for multiple commands part of the same > > session (e.g. installation of build dependencies and dpkg-buildpackage). > > Yes. My concern here is that there might be non-obvious reasons why > everyone else is doing this the other way, which could lead to behavioural > differences between unschroot and all the others that will come back to > bite us later. I do not share this concern (but other concerns of yours). The risk of behavioural differences is fairly low, because we do not expect any non-filesystem state to transition from one command to the next. Much to the contrary, the use of a pid namespace for each command ensures reliable process cleanup, so no background processes can accidentally stick around. I am concerned about behavioural differences due to the reimplementation from first principles aspect though. Jochen and Aurelien will know more here, but I think we had a fair number of ftbfs due to such differences. None of them was due to the architecture of creating a namespaces for each command and most of them were due to not having gotten right containers in general. Some were broken packages such as skipping tests when detecting schroot. Also note that just because I do not share your concern here does not imply that I'd be favouring sticking to that architecture. I expressed elsewhere that I see benefits in changing it for other reasons. At this point I more and more see this as a non-boolean question. There is a spectrum between "create namespaces once and use them for the entire session" and "create new namespaces for each command" and more and more I start to believe that what would be best for sbuild is somewhere in between. > For whole-system containers running an OS image from init upwards, > or for virtual machines, using ssh as the IPC mechanism seems > pragmatic. Recent versions of systemd can even be given a ssh public > key via the systemd.system-credentials(7) mechanism (e.g. on the kernel > command line) to set it up to be accepted for root logins, which avoids > needing to do this setup in cloud-init, autopkgtest's setup-testbed, > or similar. Another excursion: systemd goes beyond this and also provides the ssh port via an AF_VSOCK (in case of VMs) or a unix domain socket on the outside (in case of containers) to make safe discovery of the ssh access easier. > For "application" containers like the ones you would presumably want > to be using for sbuild, presumably something non-ssh is desirable. I partially concur, but this goes into the larger story I hinted at in my initial mail. If we move beyond containers and look into building inside a VM (e.g. sbuild-qemu) we are in a difficult spot, because we need e.g. systemd for booting, but we may not want it in our build environment. So long term, I think sbuild will have to differentiate between three contexts: * The system it is being run on * The containment or virtualisation environment used to perform the build * The system where the build is being performed inside the containment or virtualisation environment At present, sbuild does not distinguish the latter two and always treats them equal. When building inside a VM, we may eventually want to create a chroot inside the VM to arrive at a minimal environment. The same technique is applicable to system containers. When doing this, we minimize the build environment and do not mind the extra ssh dependency in the container or virtualisation environment. For now though, this is all wishful thinking. As long as this distinction does not exist, we pretty much want minimal application containers for building as you said. > If you build an image by importing a tarball that you have built in > whatever way you prefer, minimally something like this: > > $ cat > Dockerfile <<EOF > FROM scratch > ADD minbase.tar.gz / > EOF > $ podman build -f Dockerfile -t local-debian:sid . I don't quite understand the need for a Dockerfile here. I suspect that this is the obvious way that works reliably, but my impression was that using podman import would be easier. I had success with this: mmdebstrap --format=tar --variant=apt unstable - | podman import --change CMD=/bin/bash - local-debian/sid > then you should be able to use localhost/local-debian:sid > as a substitute for debian:sid in the examples given in > autopkgtest-virt-podman(1), either using it as-is for testing: > > $ autopkgtest -U hello*.dsc -- podman localhost/local-debian:sid This did not work for me. autopkgtest failed to create a user account. I suspect that this has one of two reasons. Either autopkgtest expects python3 to be installed and it isn't or it expects passwd to be installed and doesn't install it when missing (as passwd is non-essential). > or making an image that has been pre-prepared with some essentials like > dpkg-source, and testing in that: > > $ autopkgtest-build-podman --image localhost/local-debian:sid > ... > Successfully tagged localhost/autopkgtest/localhost/local-debian:sid Works for me. > $ autopkgtest hello*.dsc -- podman autopkgtest/localhost/local-debian:sid > (tests run) Thank you very much. I got this working for application container based testing, which provides a significant speedup compared to virt-qemu. I am more interested in providing isolation-container though as a number of tests require that and I currently tend to resort to virt-qemu for that. Sure enough, adding --init=systemd to autopkgtest-build-podman just works and a system container can also be used as an application container by autopkgtest (so there is no need to build both), but running the autopkgtest-virt-qemu --init also fails here in non-obvious ways. It appears that user creation was successful, but the user creation script is still printed in red. We're now deep into debugging specific problems in the autopkgtest/podman integration and this is probably getting off-topic for d-devel. Is the evidence thus far sufficient for turning this part of the discussion into a bug report against autopkgtest? > Adding a mode for "start from this pre-prepared minbase tarball" to all > of the autopkgtest-build-* tools (so that they don't all need to know > how to run debootstrap/mmdebstrap from first principles, and then duplicate > the necessary options to make it do the right thing), has been on my > to-do list for literally years. Maybe one day I will get there. >From my point of view, this isn't actually necessary. I expect that many people would be fine drawing images from a container registry. Those stubborn people like me will happily go the extra mile. > We could certainly also benefit from some syntactic sugar to make the > automatic choice of an image name for localhost/* podman images nicer, > with fewer repetitions of localhost/. Let me pose a possibly stupid suggestion. Much of the time when people interact with autopkgtest, there is a very limited set of backends and backend options people use frequently. Rather than making the options shorter, how about introducing an aliasing mechanism? Say I could have some ~/.config/autopkgtest.conf and whenever I run autopkgtest ... -- $BACKEND such that there is no autopkgtest-virt-$BACKEND, consult that configuration file and if there the value is assigned, expand it the assigned value. Then, I can just record my commonly used backends and options there and refer to them by memorable names of my own liking. Automatic choice of images makes things more magic, which bears negative aspects as well. > podman is unlikely to provide you with a way to generate a minbase > tarball without first creating or downloading some sort of container > image in which you can run debootstrap or mmdebstrap, because you have > to be able to start from somewhere. But you can run mmdebstrap unprivileged > in unshare mode, so that's enough to get you that starting point. I consider this part of the problem space fully solved. Please allow for another podman question (and more people than Simon know the answer). Every time I run a podman container (e.g. when I run autopkgtest) my ~/.local/share/containers grows. I think autopkgtest manages to clean up in the end, but e.g. podman run -it ... seems to leave stuff behind. Such a growing directory is problematic for multiple reasons, but I was also hoping that podman would be using fuse-overlayfs + tmpfs to run my containers instead of writing tons of stuff to my slow disk. I hoped --image-volume=tmpfs could improve this, but it did not. Of course, when I skip podman's image management and use --rootfs, I can side step this problem by choosing my root location on a tmpfs, but that's not how autopkgtest uses podman. > > We learned that sbuild --chroot-mode=unshare and unschroot spawn > > a new set of namespaces for every command. What you point out as a > > limitation also is a feature. Technically, it is a lie that the > > namespaces are always constructed in the same way. During installation > > of build depends the network namespace is not unshared while package > > builds commonly use an unshared network namespace with no interfaces but > > the loopback interface. > > I don't think podman can do this within a single run. It might be feasible > to do the setup (installing build-dependencies) with networking enabled; > leave the root filesystem of that container intact; and reuse it as the > root filesystem of the container in which the actual build runs, this time > with --network=none? Do I understand correctly that in this variant, you intend to use podman without its image management capabilities and rather just use --rootfs spawning two podman containers on the same --rootfs (one after another) where the first one installs dependencies and the second one isolates the network for building? > Or the "install build-dependencies" step (and other setup) could perhaps > even be represented as a `podman build` (with a Dockerfile/Containerfile, > FROM the image you had as your starting point), outputting a temporary > container image, in which the actual dpkg-buildpackage step can be invoked > by `podman run --network=none --rmi`? In this case, we build a complete container image for the purpose of building a package. This has interesting consequences. For one thing, we often build the same package twice, so caching such an image for some time is an obvious feature to look into. If you go that way, you may as well use mmdebstrap to construct containers with precisely your relevant build-dependencies on demand (for every build). The mmdebstrap ... | podman import ... rune would roughly work for that. Let me try to go one step back here. The podman model (and that of many other runtimes) is that one session equates one set of namespaces, but network isolation requires another set of namespaces. Your two approaches cleverly side-step this, by doing two containers on the same directory hierarchy or on-demand construction of containers (in one namespace) and running them (in other namespaces). These approaches come with limitations. The first approach requires reinventing podman's image management and doing that by hand. In particular, that prohibits us from using overlays as a means to avoid extraction or doing the extraction on-demand via e.g. squashfs. In an ideal world, I think we do want one user and mount namespace for the entire session and then do pid and network namespaces per-command as-needed. The second approach requires writing the container to disk very much degrading build performance. If we want to enable these use cases, then I fear podman is not the tool of choice as its featureset does not match these (idealized) requirements. In other words, settling on podman limits us in what we features we can implement in sbuild, but it may still allow more features than the status quo, so it still can be an incremental improvement of the status quo. The question kinda becomes whether it is reasonable to skip that podman step and head over to an architecture that enables more of our use cases. And then the question becomes whether unschroot is that better architecture or not and whether trading the risk of maintenance issues that you correctly identified is worth the additional features that we expect from it. Helmut