Hi Steffen, On Tue, 6 Feb 2018 09:53:38 +0100 Steffen Klassert <steffen.klass...@secunet.com> wrote:
> Cc Wei Wang > > On Sun, Feb 04, 2018 at 01:21:18PM +0200, Eyal Birger wrote: > > Hi, > > > > We've encountered a non released device reference upon device > > unregistration which seems to stem from xfrm policy code. > > > > The setup includes: > > - an underlay device (e.g. eth0) using IPv4 > > - an xfrm IPv6 over IPv4 tunnel routed via the underlay device > > - an ipip6 tunnel over the xfrm IPv6 tunnel > > > > When tearing down the underlay device, after traffic had passed via > > the ipip6 tunnel, log messages of the following form are observed: > > > > unregister_netdevice: waiting for eth0 to become free. Usage count > > = 2 > > Looks like this happened when the dst garbage collection code was > removed. I could not point to a commit that introduced it so I > did a bisection and this pointed to: > > commit 9514528d92d4cbe086499322370155ed69f5d06c > ipv6: call dst_dev_put() properly > > With this commit we leak the one refcount and some further commit > leaked the second one. > > > > > The below synthetic script reproduces this consistently on a fresh > > ubuntu vm running net-next v4.15-6066-ge9522a5: > > --------------------------------------------------------- > > #!/bin/bash > > > > ipsec_underlay_dst=192.168.6.1 > > ipsec_underlay_src=192.168.5.2 > > ipv6_pfx=1234 > > local_ipv6_addr="$ipv6_pfx::1" > > remote_ipv6_addr="$ipv6_pfx::2" > > > > # create dummy ipsec underlay > > ip l add dev dummy1 type dummy > > ip l set dev dummy1 up > > ip r add "$ipsec_underlay_dst/32" dev dummy1 > > ip -6 r add "$ipv6_pfx::/16" dev dummy1 > > > > ip a add dev dummy1 "$local_ipv6_addr/128" > > ip a add dev dummy1 "$ipsec_underlay_src/24" > > > > # add xfrm policy and state > > ip x p add src "$local_ipv6_addr/128" dst "$ipv6_pfx::/16" dir out > > tmpl src "$ipsec_underlay_src" dst "$ipsec_underlay_dst" proto esp > > reqid 1 mode tunnel ip x s add src "$ipsec_underlay_src" dst > > "$ipsec_underlay_dst" proto esp spi 0xcd440ce6 reqid 1 mode tunnel > > auth-trunc 'hmac(sha1)' 0x34a546d309031628962b814ef073aff1a638ad21 > > 96 enc 'cbc(aes)' 0xf31e14149c328297fe7925ad7448420e encap espinudp > > 4500 4500 0.0.0.0 > > > > # add 4o6 tunnel > > ip l add tnl46 type ip6tnl mode ipip6 local "$local_ipv6_addr" > > remote "$remote_ipv6_addr" ip l set dev tnl46 up > > ip r add 10.64.0.0/10 dev tnl46 > > > > # pass traffic so route is cached > > ping -w 1 -c 1 10.64.0.1 > > > > # remove dummy underlay > > ip l del dummy1 > > --------------------------------------------------------- > > > > Analysis: > > > > ip6_tunnel holds a dst_cache which caches its underlay dst objects. > > When devices are unregistered, non-xfrm dst objects are invlidated > > by their original creators (ipv4/ipv6/...) and thus are wiped from > > dst_cache. > > > > xfrm created routes otoh are not tracked by xfrm, and are not > > invalidated upon device unregistration, thus hold the device upon > > unregistration. > > > > The following rough sketch patch illustrates an approach overcoming > > this issue: > > --------------------------------------------------------- [snip] > > --------------------------------------------------------- > > > > This approach has the unfortunate side effects of adding a spin > > lock for the tracked list, as well as increasing struct xfrm_dst. > > Reintroducing garbage collection is probably not a so good idea. I > think the patch below should fix it a bit less intrusive. > > > Subject: [PATCH RFC] xfrm: Fix netdev refcount leak when flushing the > percpu dst cache. > > The dst garbage collection code is removed, so we need to call > dst_dev_put() on cached dst entries before we release them. > Otherwise we leak the refcount to the netdev. > > Signed-off-by: Steffen Klassert <steffen.klass...@secunet.com> > --- > net/xfrm/xfrm_policy.c | 5 ++++- > 1 file changed, 4 insertions(+), 1 deletion(-) > > diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c > index 7a23078132cf..7836b7601b49 100644 > --- a/net/xfrm/xfrm_policy.c > +++ b/net/xfrm/xfrm_policy.c > @@ -1715,8 +1715,10 @@ static int xfrm_expand_policies(const struct > flowi *fl, u16 family, static void xfrm_last_dst_update(struct > xfrm_dst *xdst, struct xfrm_dst *old) { > this_cpu_write(xfrm_last_dst, xdst); > - if (old) > + if (old) { > + dst_dev_put(&old->u.dst); > dst_release(&old->u.dst); > + } > } > > static void __xfrm_pcpu_work_fn(void) > @@ -1787,6 +1789,7 @@ void xfrm_policy_cache_flush(void) > old = per_cpu(xfrm_last_dst, cpu); > if (old && !xfrm_bundle_ok(old)) { > per_cpu(xfrm_last_dst, cpu) = NULL; > + dst_dev_put(&old->u.dst); > dst_release(&old->u.dst); > } > rcu_read_unlock(); I have tested this and indeed it prevents the leak. But... IIUC the xfrm_last_dst cache is a single instance that is updated every time a new bundle is created, whereas ip6_tunnel uses a different dst_cache for each tunnel. Invalidating the dst every time a new bundle is created effectively means that in a multiple tunnels scenario (multiple ip6_tunnels over multiple xfrm policies) there is only one active ip6_tunnel dst_cache at a time. In case multiple tunnels are used at the same times, I think this essentially renders the ip6_tunnel dst_cache useless. Eyal.