Hi Simon,

Thanks for having taken the time to do another extensive writeup. Much
appreciated.

On Wed, Jun 26, 2024 at 06:11:09PM +0100, Simon McVittie wrote:
> On Tue, 25 Jun 2024 at 18:55:45 +0200, Helmut Grohne wrote:
> > The main difference to how everyone else does this is that in a typical
> > sbuild interaction it will create a new user namespace for every single
> > command run as part of the session. sbuild issues tens of commands
> > before launching dpkg-buildpackage and each of them creates new
> > namespaces in the Linux kernel (all of them using the same uid mappings,
> > performing the same bind mounts and so on). The most common way to think
> > of containers is different: You create those namespaces once and reuse
> > the same namespace kernel objects for multiple commands part of the same
> > session (e.g. installation of build dependencies and dpkg-buildpackage).
> 
> Yes. My concern here is that there might be non-obvious reasons why
> everyone else is doing this the other way, which could lead to behavioural
> differences between unschroot and all the others that will come back to
> bite us later.

I do not share this concern (but other concerns of yours). The risk of
behavioural differences is fairly low, because we do not expect any
non-filesystem state to transition from one command to the next. Much to
the contrary, the use of a pid namespace for each command ensures
reliable process cleanup, so no background processes can accidentally
stick around.

I am concerned about behavioural differences due to the reimplementation
from first principles aspect though. Jochen and Aurelien will know more
here, but I think we had a fair number of ftbfs due to such differences.
None of them was due to the architecture of creating a namespaces for
each command and most of them were due to not having gotten right
containers in general. Some were broken packages such as skipping tests
when detecting schroot.

Also note that just because I do not share your concern here does not
imply that I'd be favouring sticking to that architecture. I expressed
elsewhere that I see benefits in changing it for other reasons. At this
point I more and more see this as a non-boolean question. There is a
spectrum between "create namespaces once and use them for the entire
session" and "create new namespaces for each command" and more and more
I start to believe that what would be best for sbuild is somewhere in
between.

> For whole-system containers running an OS image from init upwards,
> or for virtual machines, using ssh as the IPC mechanism seems
> pragmatic. Recent versions of systemd can even be given a ssh public
> key via the systemd.system-credentials(7) mechanism (e.g. on the kernel
> command line) to set it up to be accepted for root logins, which avoids
> needing to do this setup in cloud-init, autopkgtest's setup-testbed,
> or similar.

Another excursion: systemd goes beyond this and also provides the ssh
port via an AF_VSOCK (in case of VMs) or a unix domain socket on the
outside (in case of containers) to make safe discovery of the ssh access
easier.

> For "application" containers like the ones you would presumably want
> to be using for sbuild, presumably something non-ssh is desirable.

I partially concur, but this goes into the larger story I hinted at in
my initial mail. If we move beyond containers and look into building
inside a VM (e.g. sbuild-qemu) we are in a difficult spot, because we
need e.g. systemd for booting, but we may not want it in our build
environment. So long term, I think sbuild will have to differentiate
between three contexts:
 * The system it is being run on
 * The containment or virtualisation environment used to perform the
   build
 * The system where the build is being performed inside the containment
   or virtualisation environment

At present, sbuild does not distinguish the latter two and always treats
them equal. When building inside a VM, we may eventually want to create
a chroot inside the VM to arrive at a minimal environment. The same
technique is applicable to system containers. When doing this, we
minimize the build environment and do not mind the extra ssh dependency
in the container or virtualisation environment. For now though, this is
all wishful thinking. As long as this distinction does not exist, we
pretty much want minimal application containers for building as you
said.

> If you build an image by importing a tarball that you have built in
> whatever way you prefer, minimally something like this:
> 
>     $ cat > Dockerfile <<EOF
>     FROM scratch
>     ADD minbase.tar.gz /
>     EOF
>     $ podman build -f Dockerfile -t local-debian:sid .

I don't quite understand the need for a Dockerfile here. I suspect that
this is the obvious way that works reliably, but my impression was that
using podman import would be easier. I had success with this:

mmdebstrap --format=tar --variant=apt unstable - | podman import --change 
CMD=/bin/bash - local-debian/sid

> then you should be able to use localhost/local-debian:sid
> as a substitute for debian:sid in the examples given in
> autopkgtest-virt-podman(1), either using it as-is for testing:
> 
>     $ autopkgtest -U hello*.dsc -- podman localhost/local-debian:sid

This did not work for me. autopkgtest failed to create a user account. I
suspect that this has one of two reasons. Either autopkgtest expects
python3 to be installed and it isn't or it expects passwd to be
installed and doesn't install it when missing (as passwd is
non-essential).

> or making an image that has been pre-prepared with some essentials like
> dpkg-source, and testing in that:
> 
>     $ autopkgtest-build-podman --image localhost/local-debian:sid
>     ...
>     Successfully tagged localhost/autopkgtest/localhost/local-debian:sid

Works for me.

>     $ autopkgtest hello*.dsc -- podman autopkgtest/localhost/local-debian:sid
>     (tests run)

Thank you very much. I got this working for application container based
testing, which provides a significant speedup compared to virt-qemu.

I am more interested in providing isolation-container though as a number
of tests require that and I currently tend to resort to virt-qemu for
that. Sure enough, adding --init=systemd to autopkgtest-build-podman
just works and a system container can also be used as an application
container by autopkgtest (so there is no need to build both), but
running the autopkgtest-virt-qemu --init also fails here in non-obvious
ways. It appears that user creation was successful, but the user
creation script is still printed in red.

We're now deep into debugging specific problems in the
autopkgtest/podman integration and this is probably getting off-topic
for d-devel. Is the evidence thus far sufficient for turning this part
of the discussion into a bug report against autopkgtest?

> Adding a mode for "start from this pre-prepared minbase tarball" to all
> of the autopkgtest-build-* tools (so that they don't all need to know
> how to run debootstrap/mmdebstrap from first principles, and then duplicate
> the necessary options to make it do the right thing), has been on my
> to-do list for literally years. Maybe one day I will get there.

>From my point of view, this isn't actually necessary. I expect that many
people would be fine drawing images from a container registry. Those
stubborn people like me will happily go the extra mile.

> We could certainly also benefit from some syntactic sugar to make the
> automatic choice of an image name for localhost/* podman images nicer,
> with fewer repetitions of localhost/.

Let me pose a possibly stupid suggestion. Much of the time when people
interact with autopkgtest, there is a very limited set of backends and
backend options people use frequently. Rather than making the options
shorter, how about introducing an aliasing mechanism? Say I could have
some ~/.config/autopkgtest.conf and whenever I run autopkgtest ... --
$BACKEND such that there is no autopkgtest-virt-$BACKEND, consult that
configuration file and if there the value is assigned, expand it the
assigned value. Then, I can just record my commonly used backends and
options there and refer to them by memorable names of my own liking.
Automatic choice of images makes things more magic, which bears negative
aspects as well.

> podman is unlikely to provide you with a way to generate a minbase
> tarball without first creating or downloading some sort of container
> image in which you can run debootstrap or mmdebstrap, because you have
> to be able to start from somewhere. But you can run mmdebstrap unprivileged
> in unshare mode, so that's enough to get you that starting point.

I consider this part of the problem space fully solved.

Please allow for another podman question (and more people than Simon
know the answer). Every time I run a podman container (e.g. when I run
autopkgtest) my ~/.local/share/containers grows. I think autopkgtest
manages to clean up in the end, but e.g. podman run -it ...  seems to
leave stuff behind. Such a growing directory is problematic for multiple
reasons, but I was also hoping that podman would be using fuse-overlayfs
+ tmpfs to run my containers instead of writing tons of stuff to my slow
disk. I hoped --image-volume=tmpfs could improve this, but it did not.
Of course, when I skip podman's image management and use --rootfs, I can
side step this problem by choosing my root location on a tmpfs, but
that's not how autopkgtest uses podman.

> > We learned that sbuild --chroot-mode=unshare and unschroot spawn
> > a new set of namespaces for every command. What you point out as a
> > limitation also is a feature. Technically, it is a lie that the
> > namespaces are always constructed in the same way. During installation
> > of build depends the network namespace is not unshared while package
> > builds commonly use an unshared network namespace with no interfaces but
> > the loopback interface.
> 
> I don't think podman can do this within a single run. It might be feasible
> to do the setup (installing build-dependencies) with networking enabled;
> leave the root filesystem of that container intact; and reuse it as the
> root filesystem of the container in which the actual build runs, this time
> with --network=none?

Do I understand correctly that in this variant, you intend to use podman
without its image management capabilities and rather just use --rootfs
spawning two podman containers on the same --rootfs (one after another)
where the first one installs dependencies and the second one isolates
the network for building?

> Or the "install build-dependencies" step (and other setup) could perhaps
> even be represented as a `podman build` (with a Dockerfile/Containerfile,
> FROM the image you had as your starting point), outputting a temporary
> container image, in which the actual dpkg-buildpackage step can be invoked
> by `podman run --network=none --rmi`?

In this case, we build a complete container image for the purpose of
building a package. This has interesting consequences. For one thing, we
often build the same package twice, so caching such an image for some
time is an obvious feature to look into.

If you go that way, you may as well use mmdebstrap to construct
containers with precisely your relevant build-dependencies on demand
(for every build). The mmdebstrap ... | podman import ... rune would
roughly work for that.

Let me try to go one step back here. The podman model (and that of many
other runtimes) is that one session equates one set of namespaces, but
network isolation requires another set of namespaces. Your two
approaches cleverly side-step this, by doing two containers on the same
directory hierarchy or on-demand construction of containers (in one
namespace) and running them (in other namespaces).

These approaches come with limitations. The first approach requires
reinventing podman's image management and doing that by hand. In
particular, that prohibits us from using overlays as a means to avoid
extraction or doing the extraction on-demand via e.g. squashfs. In an
ideal world, I think we do want one user and mount namespace for the
entire session and then do pid and network namespaces per-command
as-needed. The second approach requires writing the container to disk
very much degrading build performance. If we want to enable these use
cases, then I fear podman is not the tool of choice as its featureset
does not match these (idealized) requirements. In other words, settling
on podman limits us in what we features we can implement in sbuild, but
it may still allow more features than the status quo, so it still can be
an incremental improvement of the status quo. The question kinda becomes
whether it is reasonable to skip that podman step and head over to an
architecture that enables more of our use cases.

And then the question becomes whether unschroot is that better
architecture or not and whether trading the risk of maintenance issues
that you correctly identified is worth the additional features that we
expect from it.

Helmut

Reply via email to