On Fri, 2019-03-15 at 21:26 +0100, Heiner Kallweit wrote: > On 15.03.2019 21:09, VDR User wrote: > > > > > > Thanks for the additional info and for testing 4.20.15. > > > > > > To rule out that the issue is caused by a regression in network or > > > > > > some other subsystem: Can you take the r8169.c from 4.20.15 and test > > > > > > it on top of 5.0? > > > > > > Meanwhile I'll look at the changes in the driver between 4.20 and > > > > > > 5.0. > > > > > > > > > > Sure, no problem! I'll copy the driver & recompile now actually. > > > > > Hopefully there aren't a ton of changes to r8169.c to sift through and > > > > > the cause isn't good at hiding itself! > > > > > > > > > I checked the driver changes new in 5.0 and there are very few > > > > functional changes. You could try to revert the following: > > > > > > > > 5317d5c6d47e ("r8169: use napi_consume_skb where possible") > > > > > > Will do, and fwiw, while I haven't been able to do tons of testing > > > today, I haven't been able to trigger the crash after replacing > > > 5.0.0's r8169.c with 4.20.15's r8169.c this morning. I'll restore the > > > file and revert the change you mentioned, and report back my findings. > > > > Heiner, > > > > After going back to vanilla kernel 5.0 and then reverting 5317d5c6d47e > > ("r8169: use napi_consume_skb where possible"), I so far have not had > > any crashes after transferring roughly 30GB back & forth. I'm not > > completely confident yet the crash is resolve with that revert and > > will continue to do further testing throughout the weekend as well. > > What confidence level do you have that 5317d5c6d47e is the culprit at > > this point? > > > Good, thanks for testing. I simply see no other change since 4.20 that > could cause these symptoms. > Using napi_consume_skb() at this place in r8169.c looks safe to me. > Option 1 is that I miss something, option 2 is that there's an issue > in the NAPI subsystem. However in the latter case I assume at least > the Mellanox and/or Intel guys would have observed the same issue > on their respective CI systems. > Let me add Alexander, maybe he can provide a hint before we go and > revert the change.
Do you have the crash log? I'd be curious what the issue is we are seeing. I agree I can't see anything obvious, but it is possible that we may be running into something we hadn't seen with the Intel and Mellanox parts. - Alex