Hi Sam and others,

On Fri, Jun 28, 2024 at 07:08:20AM -0600, Sam Hartman wrote:
> I'll be honest, I think building a new container backend makes no sense
> at all.

I looked hard at this as it was voiced by many. I have to say, I remain
unconvinced of the arguments brought forward.

> There's a lot of work that has gone into systemd-nspawn, podman, docker,
> crun, runc, and the related ecosystems.

I consider myself an expert user of systemd-nspawn. One thing that it
really lacks on bookworm is unprivileged execution. If you run your
builds as root, there is debspawn. In future, systemd-nspawn shall work
unprivileged - if your image is dm-verity signed. Bummer. I do not see
it as meeting our technical requirements in any way.

podman is a much more sensible suggestion and Simon gave a lot of
feedback on how to integrate it. Still its architecture is limiting in
multiple central aspects. For one thing, podman works with a static set
of namespaces per container instance, but what we want here is use
different network namespaces for installing build-depends and performing
a build. Another aspect is that people are already complaining about the
tarball-unpack approach taken by sbuild --chroot-mode=unshare being
slow. podman will make it slower due to requiring the unpack to happen
inside the users $HOME. My initial experiments indicate that we're in
for a factor two whereas we could get this down significantly by using
an overlayfs approach that we cannot shoehorn into podman. podman
upstream insists on CAP_SYS_ADMIN being a no go while systemd upstream
insists on CAP_SYS_ADMIN being a requirement. Whilst this is fine for
building, we also want to run autopkgtests. Running podman also requires
a systemd-logind session - something that is not usually available on a
buildd, in an application container (where you may also want to build a
package) or when you su/sudo to a different user. My conclusion is that
morphing podman into something usable is more work than writing a
container runtime and that doesn't even account for the political
disagreements involved.

Let me skip docker as it is very similar to podman in all of the aspects
above.

Then you mention crun and runc. These are vaguely API-compatible and
they are the lower level building blocks of both podman and docker. The
issue about CAP_SYS_ADMIN mentioned for podman earlier can be resolved
with ease at this level (at the cost of having containers that do not
contain, which was the reason for podman refuse doing this). The earlier
note about network namespaces fully applies here though. By going down
to this level, we also loose quite a bit of the benefits of image
management that the podman level included.

Your vague mentioning of related tools probably includes slirp4netns,
passt, uidmap and others. Tools at this level do not interfere with our
requirements and as such I fully concur with reusing them.

Beyond all of this, I am taking issue with a fundamental design decision
of all the mentioned container runtimes. They all have an architecture
that allows an outside process to "join" a container (podman exec).
Whilst that is a useful feature, it is using the setuid approach of
privilege transitions that we have learned for years to be inherently
vulnerable and that systemd folks have been working hard on replacing
with IPC mechanisms. As far as I understand it, a significant portion of
container runtime escapes work by exploiting this joining architecture
and the involuntary acquisition of host resources into a container. If
this were implemented via IPC, we could side step an entire class of
vulnerabilities.

> I think an approach that allowed sbuild to actually use a real container
> backend would be long-term more maintainable and would allow Debian's
> DevOps practices to better align with the rest of the world.

I have a hard time agreeing with this. I have been using rootless
containers far longer than podman supporting them and I still feel very
limited whenever I am supposed to use podman and prefer resorting to
other tools that are more capable and performant.

> I have some work I've been doing in this space which won't be useful to
> you because it is not built on top of sbuild.
> (Although I'd be happy to share under LGPL-3 for anyone interested.)

You can. I'm not sure we'll have to stick to sbuild. If we end up
converting our official buildds to something else, so be it. However,
I'd like to get to a point where building packages just works in a way
that doesn't require root privileges by default. We don't have this "it
just works" experience now.

> But I find that I disagree with the idea of writing a new container
> runtime for sbuild so strongly that I can no longer use sbuild for
> Debian work, so I started working on my own package building solution.

Please bear in mind that effectively, sbuild has gained its own
container runtime already and that what I am looking into here is
extracting it into a separate package interfacing with sbuild. I would
therefore rephrase it as refactoring a container runtime rather than
writing a new one.

Then if there was an alternative to sbuild that would allow unprivileged
package building in a sane way, I'd readily switch over and stop
bothering about all of this. The problem is that they are all vapor
ware while unschroot barely reached feature parity with sbuild
--chroot-mode=unshare.

> In terms of constructive feedback:
> 
> * I think your intuition that sbuild --chroot=unshare is limiting is
>   good.

At least something we agree on. :)

> * I would move toward a persistent namespace approach  because it is
>   more similar to broadly used container backends.

I agree and agree with the reason you give. I have reached the
conclusion that doing a persistent namespace requires a background
process and an IPC mechanism. (This requirement rules out
podman/docker/crun/runc.)

> * overlayfs/fuse-overlayfs are how the rest of the world is solving
>   these problems (or snapshots and the like).  Directories are kind of a
>   Debian-specific artifact that I find more and more awward to deal with
>   as the rest of my work uses containers for CI/CD.

I don't think this is fully accurate. In particular, podman performs
extraction for every container instantiation and thus requires a lot of
storage on $HOME. I agree that overlayfs is preferable, but
unfortunately, this is not how podman works. In any case, the important
piece here is not whether to use directories or overlayfs (mind the
performance difference) but hiding the storage backend behind an
abstraction that enables a user not to think about it. And that's really
what podman and docker (but not runc and crun) do.

So the more I have let this settle and experiment with podman and stuff,
the more I am reaching the conclusion that none of the existing
container runtimes provide an architecture that solves the requirements
I would like to see met. While the unschroot approach does not provide
persistent namespaces at this time, it demonstrates that we technically
can plug a container runtime into sbuild. It is not so much that I have
to write my own (I'd rather prefer not to), but trying to plug something
into sbuild and make it work practically. And that's not because sbuild
would be the best tool around, but because so many other tools integrate
into sbuild that it effectively has become a really complex API that I
don't want to reimplement. I tried plugging podman and that just
wouldn't fit, so I'll continue looking for other options despite
everyone else telling me that this is a bad idea. Maintaining a
container runtime is hard, because maintaining a code base of hundred
thousand lines is hard.  What if you'd merely need a few thousand? Thus
far, unschroot has 0.3 thousand lines (plus libraries). This also hints
that podman is solving a lot of problems just not the ones we face.

And then as we disagreed so much about container runtimes, I think the
end goal is not building in a container, but building inside a kvm as
that provides a far better isolation between guest and host. One step at
a time.

Helmut

Reply via email to