H there. I can confirm this problem still exists in newest kernels and
with the latest intel drivers as of today:

Jan 19 16:05:19 osd9 kernel: [511271.581413] i40e 0000:02:00.1: TX driver issue 
detected, PF reset issued
Jan 19 16:09:08 osd9 kernel: [511500.919380] i40e 0000:02:00.0: TX driver issue 
detected, PF reset issued


driver: i40e-2.4.3 (and xenial / 4.13 shipped driver: 2.1.14-k)
kernel: 4.13.0-25-generic #29~16.04.2-Ubuntu SMP Tue Jan 9 12:16:39 UTC 2018 
x86_64 x86_64 x86_64 GNU/Linux. Kernel loaded with nopti noibrs noibpb 
(Meltdown / Spetre mitigation disabled). 

We can trigger the issue with high load (benchmarking Ceph cluster with
fio: 4 clients, 8 threads, iodepth 256, 100% random write, 64K block
size).

Only when we use relatively large block size (64K) do we hit this
problem. With 4K blocks we do not hit this issue. We haven't tested
large random reads (that test is still to be done).

When using openvswitch port-channel (as we do) with jumbo frames ...
this port-channel will not come back online after the reset. rmmod i40e
/ modprobe i40e does the trick though.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1713553

Title:
  Intel i40e PF reset due to incorrect MDD detection

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Xenial:
  Fix Released

Bug description:
  [Impact]

  Using an Intel i40e network device, under heavy traffic load with
  TSO enabled, the device will spontaneously reset itself and issue errors
  similar to the following:

  Jun 14 14:09:51 hostname kernel: [4253913.851053] i40e 0000:05:00.1: TX 
driver issue detected, PF reset issued
  Jun 14 14:09:53 hostname kernel: [4253915.476283] i40e 0000:05:00.1: TX 
driver issue detected, PF reset issued
  Jun 14 14:09:54 hostname kernel: [4253917.411264] i40e 0000:05:00.1: TX 
driver issue detected, PF reset issued

   This causes a full reset of the PF, which causes an interruption
  in traffic flow.

  This was partially fixed by Xenial commit
  12f8cc59d5886b86372f45290166deca57a60d7a, however there is one
  additional upstream commit required to fully fix the issue:

  commit 841493a3f64395b60554afbcaa17f4350f90e764
  Author: Alexander Duyck <alexander.h.du...@intel.com>
  Date:   Tue Sep 6 18:05:04 2016 -0700

      i40e: Limit TX descriptor count in cases where frag size is
  greater than 16K

   This fix was never backported into the Xenial 4.4 kernel series, but
  is already present in the Xenial HWE (and Zesty) 4.10 kernel.

  [Testcase]

   In this case, the issue occurs at a customer site using i40e based
  Intel network cards with SR-IOV enabled. Under heavy load, the card will
  reset itself as described.

  [Regression Potential]

  As with any change to a network card driver, this may cause
  regressions with network I/O through i40e card(s).  However, this
  specific change only increases the likelyhood that any specific large
  TSO tx will need to be linearized, which will avoid the PF reset.
  Linearizing a TSO tx that did not need to be linearized will not cause
  any failures, it may only decrease performance slightly.  However this
  patch should only cause linearization when required to avoid the MDD
  detection and PF reset.

  [Other Info]

  The previous bug for this issue is bug 1700834.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1713553/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to