On Thu, Jan 12, 2017 at 09:22:13AM -0500, David Miller wrote: > From: Krister Johansen <k...@templeofstupid.com> > > The use case for this change is to allow containerized processes to bind > > to priviliged ports, but prevent them from ever being allowed to modify > > their container's network configuration. The latter is accomplished by > > ensuring that the network namespace is not a child of the user > > namespace. This modification was needed to allow the container manager > > to disable a namespace's priviliged port restrictions without exposing > > control of the network namespace to processes in the user namespace. > > This is what CAP_NET_BIND_SERVICE is for, and why it is a separate > network privilege, please use it.
It sounds like I may have done an inadequate job of explaining why I took this approach instead of going the CAP_NET_BIND_SERVICE route. In this scenario, the network namespace is created and configured first. Then the containerized processed get placed into a separate user namespace. This is so that the processes in the container, even if they somehow manage to obtain extra privilege in the userns, can never modify the network namespace. The check in ns_capable() is looking at the priviliges of the user namespace that created the netns and its parents. Even if I were to grant a process in the container CAP_NET_BIND_SERVICE, ns_capable() wouldn't recognize that as being a valid privilige for the netns. If I were to invert the order of operations and create the userns before the netns, then the capability would be recognized. However, that also allows any potential privilege escalation in the userns to bring with it the potential that an attacker can modify the container's network configuration. I'd much rather run the containers without privs, and without the userns having rights to the netns, to mitigate the risk of an attacker being able to alter the container's networking configuration. -K