Sorry for the delay.

So we have 2 options on how to continue debugging here:

1. we can try a traditional git bisect.  This would involve testing
various kernel builds, to try to eventually narrow down the issue to
being fixed by a specific commit.  It's a long-ish process, depending on
how long testing each build takes, and it's critical that verification
of 'good' or 'bad' at each step is correct - otherwise the bisect ends
at the wrong commit.  Each step will involve me building a new kernel,
you test with the kernel until it fails or you've tested long enough to
be sure that kernel build is 'good'.  With hard-to-reproduce problems
like this, bisecting can be tough, because if a build doesn't fail for a
long time, that doesn't necessarily mean it's "good", it may just not
have failed yet, in which case the bisect will end at the wrong commit,
which doesn't help with figuring out how to fix anything.

2. Intel has provided me some undocumented commands that will allow
controlling what MDD events the nic triggers on.  I can provide those
instructions, and you can test with each MDD event bit set individually,
until the problem reproduces - then we know exactly which MDD source
triggered the event, which should help identify what the driver did to
cause the MDD event.  This way has a much better chance of finding the
specific problem, but the downside is you'll need to run undocumented
commands with your hardware.  I believe there should not be any risk in
doing that since the info came from Intel, but I can't personally verify
it, as I don't currently have access to this specific NIC.

If you're willing to try #2, I'll add the specific commands/instructions
and you can get started testing.  Otherwise if you would prefer not to
run the undocumented commands, I can start a kernel bisect.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1723127

Title:
  Intel i40e PF reset due to incorrect MDD detection (continues...)

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Xenial:
  In Progress

Bug description:
  This is a continuation from bug 1713553; a patch was added in that bug
  to attempt to fix this, and it may have helped reduce the issue but
  appears not to have fixed it, based on more reports.

  The issue is the i40e driver, when TSO is enabled, sometimes sees the
  NIC firmware issue a "MDD event" where MDD is "Malicious Driver
  Detection".  This is vaguely defined in the i40e spec, but with no way
  to tell what the NIC actually saw that it didn't like.  So, the driver
  can do nothing but print an error message and reset the PF (or VF).
  Unfortunately, this resets the interface, which causes an interruption
  in network traffic flow while the PF is resetting.

  See bug 1713553 for more details.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1723127/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to