On Wed, 25.02.15 00:05, Cyrill Gorcunov ([email protected]) wrote: > Hi all! I would really appreciate if someone enlighten me if there is some > simple > solution for the problem we met in OpenVZ: modern containers are mostly > systemd > based so that once it is started up the systemd daemon mounts own instance of > the systemd cgroup (if previously has not been pre-mounted by container > startup > tools or whatever). To make a strict isolation of nested systemd cgroup (by > "nested" I mean systemd cgroup instance mounted inside container) we've > patched > the kernel so that container's systemd obtains own instance of cgroup > non-intersected > anyhow with one present on a host system. > > And we would really love to get rid of this kind of kernel's hack but be able > to isolate nested systemd with own cgroup instance using solely userspace > tools. Is there some way to reach this?
Not really. cgroupfs doesn't really allow that. First of all the root cgroup has a different set of attributes than child cgroups, hence you cannot mount an arbitrary child to the root cgroup and assume it works. But even worse, /proc/$PID/cgroup actually contains the full cgroup path, and hence mounting only a subtree would break the refernces from that file. systemd-nspawn nowadays mounts all hierarchies into the container, but mounts all controller hierarchies read-only, and of the name=systemd hierarchy mounts everything read-only, except the subtree the container is allowed to manage. That way only the cgroup tree the container needs access to is writable to it. That solution however does not hide the cgroup tree. A process running inside the container can still go an explore the tree and its attributes. However, all other groups will appear empty to it, since processes not in the container PID namespaces will be suppressed when reading the member process list. There have been proposals on LKML to add cgroup namespacings, but no idea where that went. LXC created a FUSE emulation of /proc and /sys, called lxcfs to solve this problem. Quite honestly I find this a pretty crazy idea however. > If I understand correctly we can provide separate slice to container's > systemd leaving the rest of host cgroup in ro mode, right? Yes. > If so maybe there a way to hide host cgroup completely from > container so it would see only own cgroup in sysfs? I don't see how this could work. I mean, you could overmount all other cgroup siblings with empty directories in the containers, but not realy scalable nor compatible with cgroups being added or removed later on... Lennart -- Lennart Poettering, Red Hat _______________________________________________ systemd-devel mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/systemd-devel
