On 2024-08-09 04:52, Simon McVittie wrote:
On Thu, 08 Aug 2024 at 11:12:07 -0400, Reinhard Tartler wrote:
In short, it seems to me if you are running a workload that requires
CAP_SYS_ADMIN,
then it is appropriate to pass that argument to podman. It is clearly much
better than
using --privileged (cf. [2]https://www.redhat.com/sysadmin/
podman-inside-container)

Do you think this would be reasonably accurate wording to put in
autopkgtest-virt-podman(1)?

Note that full functionality of systemd(1) as an init system requires
    the container to have CAP_SYS_ADMIN, which might allow code that
    runs as root in the container to compromise processes outside the
    container that are running under the same uid as
    autopkgtest-virt-podman. If this is consistent with your security
model, it can be enabled by passing the --cap-add=CAP_SYS_ADMIN option
    to podman_run(1):

autopkgtest ... -- podman --init $IMAGE -- --cap-add=CAP_SYS_ADMIN

Or is "might allow" too strong, or too weak?

I personally find that wording a bit too strong. How about something like this:

Note that full functionality of systemd(1) as an init system requires
    the container to have CAP_SYS_ADMIN. This means that software
    like systemd that
    runs as root in the container can issue syscall (such as mount(2),
    seccom(2), etc., see capabilities(7) for details) to setup its own
    sandboxing features. However, this also introduces an additional
    attack surface in the
    kernel if malicious code tried to escape the container sandbox.
    If this is consistent with your security
model, it can be enabled by passing the --cap-add=CAP_SYS_ADMIN option
    to podman_run(1):

autopkgtest ... -- podman --init $IMAGE -- --cap-add=CAP_SYS_ADMIN


(It might also be appropriate to add a shorthand form for that, to avoid
needing to use the "pass arbitrary options to podman-run" mechanism,
but that would need some more design to choose a suitable name for
that option. --trust-root-in-testbed, perhaps, if my understanding of
the impact of CAP_SYS_ADMIN is correct.)

I'd love to see such a shortcut, but it is not obvious to my how to name it. Your suggestion seems too strong to me, because there are typically still other
security features in play, such as seccomp, selinux or apparmor.

Thank you so much for forwarding this question to upstream at
https://github.com/containers/podman/discussions/23558. Hopefully someone
like Dan can form an answer and opinion on this.

Since you assigned this bug to podman, may I ask what the ask is?
It's not clear how to improve the podman packaging in this context.

I initially reported this as a systemd bug, thinking that systemd's
expected behaviour would be to turn off features that require
CAP_SYS_ADMIN when it isn't available (as it already does for some but
not all such features, as far as I can see), but Luca replied "this
is an issue in podman". I don't know specifically what basis he had
for that statement.

Yeah, I don't see this as an issue in podman. Again, if you need to run
a workload that requires this capability, podman offers a knob that allows
you to do so. This isn't defeating security per-se, it always depends
on the specific situation and use-case.

Perhaps he was expecting that `podman run --systemd=true` (which is the
default) would detect that we're running /sbin/init in the container,
and automatically grant access to CAP_SYS_ADMIN? But I think that would
be inappropriate as an automatic thing if giving access to CAP_SYS_ADMIN
requires trusting the container payload.

Yeah, that's not what --systemd=true does. Here is the relevant part of
the reference documentation:

https://manpages.debian.org/unstable/podman/podman-run.1.en.html#--systemd=true_%7C_false_%7C_always

Running the container in systemd mode causes the following changes:

* Podman mounts tmpfs file systems on the following directories
   /run
   /run/lock
   /tmp
   /sys/fs/cgroup/systemd (on a cgroup v1 system)
  /var/lib/journal
* Podman sets the default stop signal to SIGRTMIN+3.
* Podman sets container_uuid environment variable in the container to the first 32 characters of the container ID. * Podman does not mount virtual consoles (/dev/tty\d+) when running with --privileged.
* On cgroup v2, /sys/fs/cgroup is mounted writeable.

This allows systemd to run in a confined container without any modifications.

Re-reading through https://github.com/systemd/systemd/issues/29860 clarifies that systemd has a number of additional security hardening features, such as
DynamicUsers, but also things like PrivateDevices=`, `ProtectHome=`,
`ProtectSystem=`, `MountFlags=`, `PrivateTmp=`, `ReadWriteDirectories=`,
`ReadOnlyDirectories=`, `InaccessibleDirectories=`, and `MountFlags=`.

It occurs to me that systemd is designed as a privileged process that aims to provide sandboxing features primarily targeted at non-privileged processes,
or at least with reduced privileges. For this, it does need to execute a
number of syscalls, such as mount(2), setns(2), seccomp(2) and many others,
for which it needs CAP_SYS_ADMIN.

If nothing is going to be done about this in systemd, and nothing can be
done about it in podman, then it'll probably have to end up as a
documentation improvement in autopkgtest-virt-podman(1).

I tend to agree. I personally would be comfortable running containers
that have systemd inside with CAP_SYS_ADMIN because that is closer to
how systemd runs on a real system. Also, podman provides other additional
security features, such as seccomp and apparmor/selinux.

-rt

Reply via email to