I think they are two distinct problems, and hopefully we would get a
comment from NVIDIA/Mellanox as the statements in bug 2020409
contradicts the documentation [0] the current Netplan implementation is
based on.

Martin may have more details, but wanted to mention that one of our suspected 
culprits is with how Netplan lays out the udev rules for VF activation [1]:
1) It takes a long time when many are configured, as opposed to the expectation 
in the comment.
2) The process appears to be executed multiple times, which combined with the 
fact it takes a long time in turn may end up clashing with both the networking 
backends creation of the bond and the systemd unit rebinding the VFs.

Bug 2020409 also raises the question if there are any bond/LAG related
system bringup quirks for systems using only Scalable Functions (SF) or
a combination of SFs and VFs. I have yet to see any documentation about
that.

0: https://enterprise-support.nvidia.com/s/article/Configuring-VF-LAG-using-TC
1: 
https://github.com/canonical/netplan/blob/a7e4be03918c986020650743cb6cf0934696ef0c/src/sriov.c#L107-L112

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1988018

Title:
  [mlx5] Intermittent VF-LAG activation failure

Status in linux package in Ubuntu:
  Fix Committed
Status in netplan.io package in Ubuntu:
  Triaged
Status in linux source package in Jammy:
  New
Status in netplan.io source package in Jammy:
  In Progress
Status in linux source package in Kinetic:
  Fix Committed
Status in netplan.io source package in Kinetic:
  Won't Fix

Bug description:
  During system initialization there is a specific sequence that must be
  followed to enable the use of hardware offload and VF-LAG.

  Intermittently one may see that VF-LAG initialization fails:
  [Thu Jul 21 10:54:58 2022] mlx5_core 0000:08:00.0: lag map port 1:1 port 2:2 
shared_fdb:1
  [Thu Jul 21 10:54:58 2022] mlx5_core 0000:08:00.0: mlx5_cmd_check:782:(pid 
9): CREATE_LAG(0x840) op_mod(0x0) failed, status bad parameter(0x3), syndrome 
(0x7d49cb)
  [Thu Jul 21 10:54:58 2022] mlx5_core 0000:08:00.0: mlx5_create_lag:248:(pid 
9): Failed to create LAG (-22)
  [Thu Jul 21 10:54:58 2022] mlx5_core 0000:08:00.0: mlx5_activate_lag:288:(pid 
9): Failed to activate VF LAG
                             Make sure all VFs are unbound prior to VF LAG 
activation or deactivation

  This is caused by rebinding the driver prior to the VF lag being
  ready.

  A sysfs knob has recently been added to the driver [0] and we should
  monitor it before attempting to rebind the driver:

      $ cat /sys/kernel/debug/mlx5/0000\:08\:00.0/lag/state

  The kernel feature is available in the upcoming Kinetic 5.19 kernel
  and we should probably backport it to the Jammy 5.15 kernel.

  0:
  
https://github.com/torvalds/linux/commit/7f46a0b7327ae261f9981888708dbca22c283900

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1988018/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to