On tor, 2017-10-19 at 08:40 -0700, Alexander Duyck wrote: > On Thu, Oct 19, 2017 at 5:19 AM, Anders K. Pedersen | Cohaesio > <a...@cohaesio.com> wrote: > > Hi Alex, > > > > On ons, 2017-10-18 at 16:37 -0700, Alexander Duyck wrote: > > > When we last talked I had asked if you could do a git bisect to > > > find > > > the memory leak and you said you would look into it. The most > > > useful > > > way to solve this would be to do a git bisect between your > > > current > > > kernel and the 4.11 kernel to find the point at which this > > > started. > > > If > > > we can do that then fixing this becomes much simpler as we just > > > have > > > to fix the patch that introduced the issue. > > > > We're also seeing a smaller memory leak (about 1 GB per day) than > > the > > original one even with the "Fix memory leak related filter > > programming > > status" fix applied. So far I've determined that the leak is > > present on > > 4.13.7 and was introduced between 4.11 and 4.12, so I'll do another > > round of bisection to identify the patch that introduced this. > > > > Since the router must run for a couple of hours before I can be > > sure > > whether a kernel is good or bad, and I can't reboot it during > > working > > hours, it'll probably be about a week before I have a result. > > > > -- > > Venlig hilsen / Best Regards > > > > Anders K. Pedersen > > Senior Technical Manager > > Anders, > > I'll do some digging on my side to see if I can find any other memory > leaks that might be floating around in the driver that could have > been > introduced during that time-frame. > > One thing you might try that would help with your testing would be to > just disable the ATR functionality in i40e. You can do that with the > ethtool command "ethtool --set-priv-flags <iface> flow-director-atr > off". That should allow you to bisect this without needing to deal > with the "programming status" patches since you won't be programming > ATR filters which is what caused that leak. > > Thanks for looking into this. > > - Alex
Hi Alex, I began bisecting, where I applied the known fix patches to the steps, where they were applicable (i.e. without changing the flow-director-atr flag), but some of the steps had a high amount of packet drops, which caused problems for our network, so I couldn't leave them running for several hours, which is necessary to determine if the leak is present or not. The part of the bisection I got through had the same outcome as the last bisection, which led to "i40e: Fix support for flow director programming status". After that I experimented a bit with the flow-director-atr flag, and it turns out that if I disable this flag on all the NICs, then the memory leak is gone, so I suspected that the smaller memory leak was also caused by "i40e: Fix support for flow director programming status". I tried to revert this patch from 4.13 (with manual fixup for the trace point that had been added later), but that brought back the packet drops, so I couldn't let it run. This morning I saw your "i40e: Add programming descriptors to cleaned_count" patch, so I tried 4.13.9 with that patch and the previous "i40e: Fix memory leak related filter programming status" without turning off the flow-director-atr flag. So far this combination is running stable without any memory leaks. Thanks for fixing this. Regards, Anders