Hi Sam and others, On Fri, Jun 28, 2024 at 07:08:20AM -0600, Sam Hartman wrote: > I'll be honest, I think building a new container backend makes no sense > at all.
I looked hard at this as it was voiced by many. I have to say, I remain unconvinced of the arguments brought forward. > There's a lot of work that has gone into systemd-nspawn, podman, docker, > crun, runc, and the related ecosystems. I consider myself an expert user of systemd-nspawn. One thing that it really lacks on bookworm is unprivileged execution. If you run your builds as root, there is debspawn. In future, systemd-nspawn shall work unprivileged - if your image is dm-verity signed. Bummer. I do not see it as meeting our technical requirements in any way. podman is a much more sensible suggestion and Simon gave a lot of feedback on how to integrate it. Still its architecture is limiting in multiple central aspects. For one thing, podman works with a static set of namespaces per container instance, but what we want here is use different network namespaces for installing build-depends and performing a build. Another aspect is that people are already complaining about the tarball-unpack approach taken by sbuild --chroot-mode=unshare being slow. podman will make it slower due to requiring the unpack to happen inside the users $HOME. My initial experiments indicate that we're in for a factor two whereas we could get this down significantly by using an overlayfs approach that we cannot shoehorn into podman. podman upstream insists on CAP_SYS_ADMIN being a no go while systemd upstream insists on CAP_SYS_ADMIN being a requirement. Whilst this is fine for building, we also want to run autopkgtests. Running podman also requires a systemd-logind session - something that is not usually available on a buildd, in an application container (where you may also want to build a package) or when you su/sudo to a different user. My conclusion is that morphing podman into something usable is more work than writing a container runtime and that doesn't even account for the political disagreements involved. Let me skip docker as it is very similar to podman in all of the aspects above. Then you mention crun and runc. These are vaguely API-compatible and they are the lower level building blocks of both podman and docker. The issue about CAP_SYS_ADMIN mentioned for podman earlier can be resolved with ease at this level (at the cost of having containers that do not contain, which was the reason for podman refuse doing this). The earlier note about network namespaces fully applies here though. By going down to this level, we also loose quite a bit of the benefits of image management that the podman level included. Your vague mentioning of related tools probably includes slirp4netns, passt, uidmap and others. Tools at this level do not interfere with our requirements and as such I fully concur with reusing them. Beyond all of this, I am taking issue with a fundamental design decision of all the mentioned container runtimes. They all have an architecture that allows an outside process to "join" a container (podman exec). Whilst that is a useful feature, it is using the setuid approach of privilege transitions that we have learned for years to be inherently vulnerable and that systemd folks have been working hard on replacing with IPC mechanisms. As far as I understand it, a significant portion of container runtime escapes work by exploiting this joining architecture and the involuntary acquisition of host resources into a container. If this were implemented via IPC, we could side step an entire class of vulnerabilities. > I think an approach that allowed sbuild to actually use a real container > backend would be long-term more maintainable and would allow Debian's > DevOps practices to better align with the rest of the world. I have a hard time agreeing with this. I have been using rootless containers far longer than podman supporting them and I still feel very limited whenever I am supposed to use podman and prefer resorting to other tools that are more capable and performant. > I have some work I've been doing in this space which won't be useful to > you because it is not built on top of sbuild. > (Although I'd be happy to share under LGPL-3 for anyone interested.) You can. I'm not sure we'll have to stick to sbuild. If we end up converting our official buildds to something else, so be it. However, I'd like to get to a point where building packages just works in a way that doesn't require root privileges by default. We don't have this "it just works" experience now. > But I find that I disagree with the idea of writing a new container > runtime for sbuild so strongly that I can no longer use sbuild for > Debian work, so I started working on my own package building solution. Please bear in mind that effectively, sbuild has gained its own container runtime already and that what I am looking into here is extracting it into a separate package interfacing with sbuild. I would therefore rephrase it as refactoring a container runtime rather than writing a new one. Then if there was an alternative to sbuild that would allow unprivileged package building in a sane way, I'd readily switch over and stop bothering about all of this. The problem is that they are all vapor ware while unschroot barely reached feature parity with sbuild --chroot-mode=unshare. > In terms of constructive feedback: > > * I think your intuition that sbuild --chroot=unshare is limiting is > good. At least something we agree on. :) > * I would move toward a persistent namespace approach because it is > more similar to broadly used container backends. I agree and agree with the reason you give. I have reached the conclusion that doing a persistent namespace requires a background process and an IPC mechanism. (This requirement rules out podman/docker/crun/runc.) > * overlayfs/fuse-overlayfs are how the rest of the world is solving > these problems (or snapshots and the like). Directories are kind of a > Debian-specific artifact that I find more and more awward to deal with > as the rest of my work uses containers for CI/CD. I don't think this is fully accurate. In particular, podman performs extraction for every container instantiation and thus requires a lot of storage on $HOME. I agree that overlayfs is preferable, but unfortunately, this is not how podman works. In any case, the important piece here is not whether to use directories or overlayfs (mind the performance difference) but hiding the storage backend behind an abstraction that enables a user not to think about it. And that's really what podman and docker (but not runc and crun) do. So the more I have let this settle and experiment with podman and stuff, the more I am reaching the conclusion that none of the existing container runtimes provide an architecture that solves the requirements I would like to see met. While the unschroot approach does not provide persistent namespaces at this time, it demonstrates that we technically can plug a container runtime into sbuild. It is not so much that I have to write my own (I'd rather prefer not to), but trying to plug something into sbuild and make it work practically. And that's not because sbuild would be the best tool around, but because so many other tools integrate into sbuild that it effectively has become a really complex API that I don't want to reimplement. I tried plugging podman and that just wouldn't fit, so I'll continue looking for other options despite everyone else telling me that this is a bad idea. Maintaining a container runtime is hard, because maintaining a code base of hundred thousand lines is hard. What if you'd merely need a few thousand? Thus far, unschroot has 0.3 thousand lines (plus libraries). This also hints that podman is solving a lot of problems just not the ones we face. And then as we disagreed so much about container runtimes, I think the end goal is not building in a container, but building inside a kvm as that provides a far better isolation between guest and host. One step at a time. Helmut