On Mon, Jan 23, 2017 at 12:36:08PM -0800, Andy Lutomirski wrote: > To see how cgroup+bpf interacts with network namespaces, I wrote a > little program called show_bind that calls getsockopt(..., > SO_BINDTODEVICE, ...) and prints the result. It did this: > > # ./ip link add dev vrf0 type vrf table 10 > # ./ip vrf exec vrf0 ./show_bind > Default binding is "vrf0" > # ./ip vrf exec vrf0 unshare -n ./show_bind > show_bind: getsockopt: No such device > > What's happening here is that "ip vrf" looks up vrf0's ifindex in > the init netns and installs a hook that binds sockets to that > ifindex. When the hook runs in a different netns, it sets > sk_bound_dev_if to an ifindex from the wrong netns, resulting in > incorrect behavior. In this particular example, the ifindex was 4 > and there was no ifindex 4 in the new netns. If there had been, > this test would have malfunctioned differently > > Since it's rather late in the release cycle, let's punt. This patch > makes it impossible to install cgroup+bpf hooks outside the init > netns and makes them not run on sockets that aren't in the init > netns. > > In a future release, it should be relatively straightforward to make > these hooks be local to a netns and, if needed, to add a flag so > that hooks can be made global if necessary. Global hooks should > presumably be constrained so that they can't write to any ifindex > fields. > > Cc: Daniel Borkmann <dan...@iogearbox.net> > Cc: Alexei Starovoitov <a...@kernel.org> > Cc: David Ahern <d...@cumulusnetworks.com> > Signed-off-by: Andy Lutomirski <l...@kernel.org> > --- > > DaveM, this mitigates a bug in a feature that's new in 4.10, and the > bug can be hit using current iproute2 -git. please consider this for > -net. > > Changes from v1: > - Fix the commit message. 'git commit' was very clever and thought that > all the interesting bits of the test case were intended to be comments > and stripped them. Whoops! > > kernel/bpf/cgroup.c | 21 +++++++++++++++++++++ > kernel/bpf/syscall.c | 11 +++++++++++ > 2 files changed, 32 insertions(+) > > diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c > index a515f7b007c6..a824f543de69 100644 > --- a/kernel/bpf/cgroup.c > +++ b/kernel/bpf/cgroup.c > @@ -143,6 +143,17 @@ int __cgroup_bpf_run_filter_skb(struct sock *sk, > if (!sk || !sk_fullsock(sk)) > return 0; > > + /* > + * For now, socket bpf hooks attached to cgroups can only be > + * installed in the init netns and only affect the init netns. > + * This could be relaxed in the future once some semantic issues > + * are resolved. For example, ifindexes belonging to one netns > + * should probably not be visible to hooks installed by programs > + * running in a different netns. > + */ > + if (sock_net(sk) != &init_net) > + return 0;
Nack. Such check will create absolutely broken abi that we won't be able to fix later. > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c > index e89acea22ecf..c0bbc55e244d 100644 > --- a/kernel/bpf/syscall.c > +++ b/kernel/bpf/syscall.c > @@ -902,6 +902,17 @@ static int bpf_prog_attach(const union bpf_attr *attr) > struct cgroup *cgrp; > enum bpf_prog_type ptype; > > + /* > + * For now, socket bpf hooks attached to cgroups can only be > + * installed in the init netns and only affect the init netns. > + * This could be relaxed in the future once some semantic issues > + * are resolved. For example, ifindexes belonging to one netns > + * should probably not be visible to hooks installed by programs > + * running in a different netns. the comment is bogus and shows complete lack of understanding how bpf is working and how it's being used. See SKF_AD_IFINDEX that was in classic bpf forever. > + */ > + if (current->nsproxy->net_ns != &init_net) > + return -EINVAL; this restriction I actually don't mind, since it indeed can be relaxed later, but please make it proper with net_eq()