Bug#1078205: systemd: can't start polkitd in a podman container without CAP_SYS_ADMIN

Reinhard Tartler Sat, 10 Aug 2024 08:30:15 -0700

On 2024-08-09 04:52, Simon McVittie wrote:

On Thu, 08 Aug 2024 at 11:12:07 -0400, Reinhard Tartler wrote:

In short, it seems to me if you are running a workload that requires
CAP_SYS_ADMIN,

then it is appropriate to pass that argument to podman. It is clearlymuch

better than
using --privileged (cf. [2]https://www.redhat.com/sysadmin/
podman-inside-container)


Do you think this would be reasonably accurate wording to put in
autopkgtest-virt-podman(1)?

Note that full functionality of systemd(1) as an init systemrequires

    the container to have CAP_SYS_ADMIN, which might allow code that
    runs as root in the container to compromise processes outside the
    container that are running under the same uid as
    autopkgtest-virt-podman. If this is consistent with your security

model, it can be enabled by passing the --cap-add=CAP_SYS_ADMINoption

    to podman_run(1):

autopkgtest ... -- podman --init $IMAGE ----cap-add=CAP_SYS_ADMIN


Or is "might allow" too strong, or too weak?

I personally find that wording a bit too strong. How about somethinglike this:

Note that full functionality of systemd(1) as an init systemrequires

    the container to have CAP_SYS_ADMIN. This means that software
    like systemd that
    runs as root in the container can issue syscall (such as mount(2),
    seccom(2), etc., see capabilities(7) for details) to setup its own
    sandboxing features. However, this also introduces an additional
    attack surface in the
    kernel if malicious code tried to escape the container sandbox.
    If this is consistent with your security

model, it can be enabled by passing the --cap-add=CAP_SYS_ADMINoption

    to podman_run(1):

autopkgtest ... -- podman --init $IMAGE ----cap-add=CAP_SYS_ADMIN

(It might also be appropriate to add a shorthand form for that, toavoid

needing to use the "pass arbitrary options to podman-run" mechanism,
but that would need some more design to choose a suitable name for
that option. --trust-root-in-testbed, perhaps, if my understanding of
the impact of CAP_SYS_ADMIN is correct.)

I'd love to see such a shortcut, but it is not obvious to my how to nameit.Your suggestion seems too strong to me, because there are typicallystill other

security features in play, such as seccomp, selinux or apparmor.

Thank you so much for forwarding this question to upstream at

https://github.com/containers/podman/discussions/23558. Hopefullysomeone

like Dan can form an answer and opinion on this.

Since you assigned this bug to podman, may I ask what the ask is?
It's not clear how to improve the podman packaging in this context.


I initially reported this as a systemd bug, thinking that systemd's
expected behaviour would be to turn off features that require
CAP_SYS_ADMIN when it isn't available (as it already does for some but
not all such features, as far as I can see), but Luca replied "this
is an issue in podman". I don't know specifically what basis he had
for that statement.


Yeah, I don't see this as an issue in podman. Again, if you need to run

a workload that requires this capability, podman offers a knob thatallows

you to do so. This isn't defeating security per-se, it always depends
on the specific situation and use-case.

Perhaps he was expecting that `podman run --systemd=true` (which is the
default) would detect that we're running /sbin/init in the container,
and automatically grant access to CAP_SYS_ADMIN? But I think that would

be inappropriate as an automatic thing if giving access toCAP_SYS_ADMIN

requires trusting the container payload.


Yeah, that's not what --systemd=true does. Here is the relevant part of
the reference documentation:

https://manpages.debian.org/unstable/podman/podman-run.1.en.html#--systemd=true_%7C_false_%7C_always

Running the container in systemd mode causes the following changes:

* Podman mounts tmpfs file systems on the following directories
   /run
   /run/lock
   /tmp
   /sys/fs/cgroup/systemd (on a cgroup v1 system)
  /var/lib/journal
* Podman sets the default stop signal to SIGRTMIN+3.

* Podman sets container_uuid environment variable in the container tothe first 32 characters of the container ID.* Podman does not mount virtual consoles (/dev/tty\d+) when running with--privileged.

* On cgroup v2, /sys/fs/cgroup is mounted writeable.

This allows systemd to run in a confined container without anymodifications.

Re-reading through https://github.com/systemd/systemd/issues/29860clarifiesthat systemd has a number of additional security hardening features,such as

DynamicUsers, but also things like PrivateDevices=`, `ProtectHome=`,
`ProtectSystem=`, `MountFlags=`, `PrivateTmp=`, `ReadWriteDirectories=`,
`ReadOnlyDirectories=`, `InaccessibleDirectories=`, and `MountFlags=`.

It occurs to me that systemd is designed as a privileged process thataimsto provide sandboxing features primarily targeted at non-privilegedprocesses,

or at least with reduced privileges. For this, it does need to execute a

number of syscalls, such as mount(2), setns(2), seccomp(2) and manyothers,

for which it needs CAP_SYS_ADMIN.

If nothing is going to be done about this in systemd, and nothing canbe
done about it in podman, then it'll probably have to end up as a
documentation improvement in autopkgtest-virt-podman(1).


I tend to agree. I personally would be comfortable running containers
that have systemd inside with CAP_SYS_ADMIN because that is closer to

how systemd runs on a real system. Also, podman provides otheradditional

security features, such as seccomp and apparmor/selinux.

-rt

Bug#1078205: systemd: can't start polkitd in a podman container without CAP_SYS_ADMIN

Reply via email to