Re: [PATCH RFC bpf-next 1/6] bpf: Hooks for sys_bind

Eric Dumazet Wed, 14 Mar 2018 17:18:35 -0700


On 03/14/2018 11:41 AM, Alexei Starovoitov wrote:
> On Wed, Mar 14, 2018 at 11:00 AM, Alexei Starovoitov
> <alexei.starovoi...@gmail.com> wrote:
>>
>>> It seems this is exactly the case where a netns would be the correct answer.
>>
>> Unfortuantely that's not the case. That's what I tried to explain
>> in the cover letter:
>> "The setup involves per-container IPs, policy, etc, so traditional
>> network-only solutions that involve VRFs, netns, acls are not applicable."
>> To elaborate more on that:
>> netns is l2 isolation.
>> vrf is l3 isolation.
>> whereas to containerize an application we need to punch connectivity holes
>> in these layered techniques.
>> We also considered resurrecting Hannes's afnetns work
>> and even went as far as designing a new namespace for L4 isolation.
>> Unfortunately all hierarchical namespace abstraction don't work.
>> To run an application inside cgroup container that was not written
>> with containers in mind we have to make an illusion of running
>> in non-containerized environment.
>> In some cases we remember the port and container id in the post-bind hook
>> in a bpf map and when some other task in a different container is trying
>> to connect to a service we need to know where this service is running.
>> It can be remote and can be local. Both client and service may or may not
>> be written with containers in mind and this sockaddr rewrite is providing
>> connectivity and load balancing feature that you simply cannot do
>> with hierarchical networking primitives.
> 
> have to explain this a bit further...
> We also considered hacking these 'connectivity holes' in
> netns and/or vrf, but that would be real layering violation,
> since clean l2, l3 abstraction would suddenly support
> something that breaks through the layers.
> Just like many consider ipvlan a bad hack that punches
> through the layers and connects l2 abstraction of netns
> at l3 layer, this is not something kernel should ever do.
> We really didn't want another ipvlan-like hack in the kernel.
> Instead bpf programs at bind/connect time _help_
> applications discover and connect to each other.
> All containers are running in init_nens and there are no vrfs.
> After bind/connect the normal fib/neighbor core networking
> logic works as it should always do. The whole system is
> clean from network point of view.



We apparently missed something when deploying ipvlan and one netns per
container/job

Full access to 64K ports, no more ports being reserved/abused.
If one job needs more, no problem, just use more than one IP per netns.

It also works with UDP just fine. Are you considering adding a hook
later for sendmsg() (unconnected socket or not), or do you want to use
the existing one in ip_finish_output(), adding per-packet overhead ?

This notion of 'clean l2, l3 abstraction' is very subjective.
I find netns isolation very clean, powerful, and it is there already.

eBPF is certainly nice, but pretending netns/ipvlan are hacks is not
credible.

Re: [PATCH RFC bpf-next 1/6] bpf: Hooks for sys_bind

Reply via email to