From: Daniel Borkmann <dan...@iogearbox.net> Date: Wed, 30 Jan 2019 12:49:48 +0100
> While implementing ipvlan l3 and l3s mode for kubernetes CNI plugin, > I ran into the issue that while l3 mode is working fine, l3s mode > does not have any connectivity to kube-apiserver and hence all pods > end up in Error state as well. The ipvlan master device sits on > top of a bond device and hostns traffic to kube-apiserver (also running > in hostns) is DNATed from 10.152.183.1:443 to 139.178.29.207:37573 > where the latter is the address of the bond0. While in l3 mode, a > curl to https://10.152.183.1:443 or to https://139.178.29.207:37573 > works fine from hostns, neither of them do in case of l3s. In the > latter only a curl to https://127.0.0.1:37573 appeared to work where > for local addresses of bond0 I saw kernel suddenly starting to emit > ARP requests to query HW address of bond0 which remained unanswered > and neighbor entries in INCOMPLETE state. These ARP requests only > happen while in l3s. > > Debugging this further, I found the issue is that l3s mode is piggy- > backing on l3 master device, and in this case local routes are using > l3mdev_master_dev_rcu(dev) instead of net->loopback_dev as per commit > f5a0aab84b74 ("net: ipv4: dst for local input routes should use l3mdev > if relevant") and 5f02ce24c269 ("net: l3mdev: Allow the l3mdev to be > a loopback"). I found that reverting them back into using the > net->loopback_dev fixed ipvlan l3s connectivity and got everything > working for the CNI. > > Now judging from 4fbae7d83c98 ("ipvlan: Introduce l3s mode") and the > l3mdev paper in [0] the only sole reason why ipvlan l3s is relying > on l3 master device is to get the l3mdev_ip_rcv() receive hook for > setting the dst entry of the input route without adding its own > ipvlan specific hacks into the receive path, however, any l3 domain > semantics beyond just that are breaking l3s operation. Note that > ipvlan also has the ability to dynamically switch its internal > operation from l3 to l3s for all ports via ipvlan_set_port_mode() > at runtime. In any case, l3 vs l3s soley distinguishes itself by > 'de-confusing' netfilter through switching skb->dev to ipvlan slave > device late in NF_INET_LOCAL_IN before handing the skb to L4. > > Minimal fix taken here is to add a IFF_L3MDEV_RX_HANDLER flag which, > if set from ipvlan setup, gets us only the wanted l3mdev_l3_rcv() hook > without any additional l3mdev semantics on top. This should also have > minimal impact since dev->priv_flags is already hot in cache. With > this set, l3s mode is working fine and I also get things like > masquerading pod traffic on the ipvlan master properly working. > > [0] https://netdevconf.org/1.2/papers/ahern-what-is-l3mdev-paper.pdf > > Fixes: f5a0aab84b74 ("net: ipv4: dst for local input routes should use l3mdev > if relevant") > Fixes: 5f02ce24c269 ("net: l3mdev: Allow the l3mdev to be a loopback") > Fixes: 4fbae7d83c98 ("ipvlan: Introduce l3s mode") > Signed-off-by: Daniel Borkmann <dan...@iogearbox.net> Applied and queued up for -stable, thanks.