Control: tags -1 + confirmed

On Sun, Apr 28, 2024 at 10:59:14PM +0200, Johannes Schauer Marin Rodrigues 
wrote:
> Quoting Aurelien Jarno (2024-04-28 15:57:29)
> > When running sbuild in unshare chroot mode, it is not possible to write to
> > /dev/stdout:
> > 
> > | echo test > /dev/stdout
> > | sh: 1: cannot create /dev/stdout: Permission denied
> > 
> > This is the reason of the FTBFS of at least clisp and supervisor when using
> > the unshare chroot mode of sbuild.

Jochen asked me to look into this. Let me write down what I have for the
benefit of the next person dumping brain cells into it.

I used bookworm's sbuild to reproduce it using the supervisor package
and that readily reproduced. I added an execute_before_dh_auto_test with
a few diagnostics:

| ls -la /dev/stdout
| lrwxrwxrwx 1 root root 15 Apr 30 07:36 /dev/stdout -> /proc/self/fd/1
| ls -la /proc/self/fd
| total 0
| dr-x------ 2 helmut sbuild  0 Apr 30 07:36 .
| dr-xr-xr-x 9 helmut sbuild  0 Apr 30 07:36 ..
| lrwx------ 1 helmut sbuild 64 Apr 30 07:36 0 -> /dev/null
| l-wx------ 1 helmut sbuild 64 Apr 30 07:36 1 -> pipe:[135566170]
| l-wx------ 1 helmut sbuild 64 Apr 30 07:36 2 -> pipe:[135566170]
| lr-x------ 1 helmut sbuild 64 Apr 30 07:36 3 -> /proc/123/fd
| echo hello > /proc/self/fd/1
| /bin/sh: 1: cannot create /proc/self/fd/1: Permission denied

I also added --anything-failed-commands=%SBUILD_SHELL and there things
look different.

| # ls -la /proc/self/fd/1
| l-wx------ 1 root root 64 Apr 30 07:44 /proc/self/fd/1 -> /dev/tty
| # runuser -u helmut bash
| $ ls -la /proc/self/fd/1
| l-wx------ 1 helmut sbuild 64 Apr 30 07:48 /proc/self/fd/1 -> /dev/tty

Running supervisor's test suite succeeds here.

Quite certainly, the cause is connected to that pipe. The pipe in
question is connecting the build log to a process that filters the build
log and replaces PKGBUILDDIR and stuff. As far as I understand it, the
crucial bit is that this process runs outside of the namespace.

To confirm this hypothesis, I tried the following override:

| override_dh_auto_test:
|       dh_auto_test | cat

In essence, I am placing another process (cat) inside the namespace such
that the stdout pipe of the test resides fully inside the namespace and
cat is responsible for writing to the pipe outside without going via
/proc/self/fd. With this modification, the build works again.

> This works in podman. So somehow it's possible to connect /dev/stdout in a way
> which preserves its intended functionality. Probably it would be useful to 
> find
> out how podman does this. For what its worth, mmdebstrap itself suffers from
> the same problem, so whatever fix is used in sbuild should probably also be
> added to mmdebstrap.

This does not work in podman
https://github.com/containers/podman/issues/16870 nor on docker
https://github.com/moby/moby/issues/31243. It sometimes works and that
sometimes is when you run it interactively and thus stdout points to a
tty device. As soon as it is a that pipe thingy, it fails.

This is actually something I researched more deeply a while ago without
success. I was trying to open a regular file in the initial namespace,
inherit the open file across unshare into a user and mount namespace and
then open /proc/self/fd/N. Likewise, I get -EACCES there in the very
same way. Some part of permission management prevents this kind of
(intentional) leakage of file descriptors, but I cannot tell which or
why.

The lesson learned seems to be that when you run a container workload,
your stdout or stderr should either connect to a tty or to a process
that lives inside your namespace (not sure which of them).

It also seems possible to change permission of those pipes
https://github.com/containers/conmon/pull/112 but I do not understand
what it means to do so and whether that technically is a good idea. If
you

    chmod(0666, *STDOUT);

right before unsharing in Sbuild/Utility.pm, the supervisor test also
passes, but this can also have undesired effects if stdout is connected
to a regular file. So we really should check that STDOUT is a pipe
before doing so. There is protection in the sense that /proc/self/fd by
default is mode 0500. I also note that posix says that fchmod should
return -EINVAL when it is performed on a pipe, so doing this very much
is a linux-ism (but namespaces already are).

To see whether stdout is a pipe, we may fstat it and figure out whether
its st_mode has S_IFIFO. In perl, that's:

    use Fcntl ':mode';
    ... if (((stat(*STDOUT))[2] & S_IFMT) == S_IFIFO);

Going deeper with research, think this is actually not a namespace
problem. https://groups.google.com/g/fa.linux.kernel/c/WVFgFngkJZw
indicates a very similar problem with doing setuid. We can emulate this
locally and reproduce the failure

    unshare -U --map-auto -S 0 -G 0 sh -c \
        '/sbin/runuser -u daemon -- sh -c ": >/proc/self/fd/1" | cat'

noting that the use of unshare here is purely added for the benefit of
running the test code unprivileged. You can also just paste the shell
part into a regular root shell in the initial namespace and have it
exhibit -EACCES in the very same way.

It is probably worth noting that the end of a pipe bears quite some
resemblence with a file on Linux. It has owner, group, permission,
timestamps and stuff. You can inspect using

    unshare -U --map-auto -S 0 -G 0 sh -c \
        '/sbin/runuser -u daemon -- python3 -c "import os;print(os.fstat(1))" | 
cat'

and also drop the "| cat" for comparison.

So given that we can only access the pipe via its fd number or
/proc/PID/fd/N and that /proc/PID/fd is mode 0500, the chmod is probably
safe and the alternative would be using fchown to assign the write end
to the build user.

I hope this helps in constructing a solution and also is an enlightening
read on what goes on behind the curtain.

Helmut

Reply via email to