On 03/14/2018 11:41 AM, Alexei Starovoitov wrote: > On Wed, Mar 14, 2018 at 11:00 AM, Alexei Starovoitov > <alexei.starovoi...@gmail.com> wrote: >> >>> It seems this is exactly the case where a netns would be the correct answer. >> >> Unfortuantely that's not the case. That's what I tried to explain >> in the cover letter: >> "The setup involves per-container IPs, policy, etc, so traditional >> network-only solutions that involve VRFs, netns, acls are not applicable." >> To elaborate more on that: >> netns is l2 isolation. >> vrf is l3 isolation. >> whereas to containerize an application we need to punch connectivity holes >> in these layered techniques. >> We also considered resurrecting Hannes's afnetns work >> and even went as far as designing a new namespace for L4 isolation. >> Unfortunately all hierarchical namespace abstraction don't work. >> To run an application inside cgroup container that was not written >> with containers in mind we have to make an illusion of running >> in non-containerized environment. >> In some cases we remember the port and container id in the post-bind hook >> in a bpf map and when some other task in a different container is trying >> to connect to a service we need to know where this service is running. >> It can be remote and can be local. Both client and service may or may not >> be written with containers in mind and this sockaddr rewrite is providing >> connectivity and load balancing feature that you simply cannot do >> with hierarchical networking primitives. > > have to explain this a bit further... > We also considered hacking these 'connectivity holes' in > netns and/or vrf, but that would be real layering violation, > since clean l2, l3 abstraction would suddenly support > something that breaks through the layers. > Just like many consider ipvlan a bad hack that punches > through the layers and connects l2 abstraction of netns > at l3 layer, this is not something kernel should ever do. > We really didn't want another ipvlan-like hack in the kernel. > Instead bpf programs at bind/connect time _help_ > applications discover and connect to each other. > All containers are running in init_nens and there are no vrfs. > After bind/connect the normal fib/neighbor core networking > logic works as it should always do. The whole system is > clean from network point of view.
We apparently missed something when deploying ipvlan and one netns per container/job Full access to 64K ports, no more ports being reserved/abused. If one job needs more, no problem, just use more than one IP per netns. It also works with UDP just fine. Are you considering adding a hook later for sendmsg() (unconnected socket or not), or do you want to use the existing one in ip_finish_output(), adding per-packet overhead ? This notion of 'clean l2, l3 abstraction' is very subjective. I find netns isolation very clean, powerful, and it is there already. eBPF is certainly nice, but pretending netns/ipvlan are hacks is not credible.