Re: [systemd-devel] [Q] About supporting nested systemd daemon

Lennart Poettering Wed, 25 Feb 2015 09:48:38 -0800

On Wed, 25.02.15 00:05, Cyrill Gorcunov ([email protected]) wrote:

> Hi all! I would really appreciate if someone enlighten me if there is some 
> simple
> solution for the problem we met in OpenVZ: modern containers are mostly 
> systemd
> based so that once it is started up the systemd daemon mounts own instance of
> the systemd cgroup (if previously has not been pre-mounted by container 
> startup
> tools or whatever). To make a strict isolation of nested systemd cgroup (by
> "nested" I mean systemd cgroup instance mounted inside container) we've 
> patched
> the kernel so that container's systemd obtains own instance of cgroup 
> non-intersected
> anyhow with one present on a host system.
> 
> And we would really love to get rid of this kind of kernel's hack but be able
> to isolate nested systemd with own cgroup instance using solely userspace
> tools. Is there some way to reach this?


Not really. cgroupfs doesn't really allow that. First of all the root
cgroup has a different set of attributes than child cgroups, hence you
cannot mount an arbitrary child to the root cgroup and assume it
works. But even worse, /proc/$PID/cgroup actually contains the full
cgroup path, and hence mounting only a subtree would break the
refernces from that file.

systemd-nspawn nowadays mounts all hierarchies into the container, but
mounts all controller hierarchies read-only, and of the name=systemd
hierarchy mounts everything read-only, except the subtree the
container is allowed to manage. That way only the cgroup tree the
container needs access to is writable to it. That solution however
does not hide the cgroup tree. A process running inside the container
can still go an explore the tree and its attributes. However, all
other groups will appear empty to it, since processes not in the
container PID namespaces will be suppressed when reading the member
process list.

There have been proposals on LKML to add cgroup namespacings, but no
idea where that went.

LXC created a FUSE emulation of /proc and /sys, called lxcfs to solve
this problem. Quite honestly I find this a pretty crazy idea however.

> If I understand correctly we can provide separate slice to container's
> systemd leaving the rest of host cgroup in ro mode, right?

Yes.

> If so maybe there a way to hide host cgroup completely from
> container so it would see only own cgroup in sysfs?

I don't see how this could work. I mean, you could overmount all other
cgroup siblings with empty directories in the containers, but not
realy scalable nor compatible with cgroups being added or removed
later on...

Lennart

-- 
Lennart Poettering, Red Hat
_______________________________________________
systemd-devel mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] [Q] About supporting nested systemd daemon

Reply via email to