** Description changed:

  [ Impact ]
  
- Due to limitations in how Netplan handles SR-IOV devices, features such as
- VF-LAG and Scalable Functions couldn't be used. Certain configuration steps
+ Due to limitations in how Netplan handles SR-IOV devices, the VF-LAG
+ feature found on Mellanox NICs couldn't be used. Certain configuration steps
  must happen in a very specific order and Netplan fails to perform the set up 
correctly.
- 
- This SRU addresses the following two problems:
- 
- 1) Fail to activate Mellanox VF-LAG -
- https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1988018
  
  Netplan must wait until the backend finishes adding interfaces to the Bond
  and the Mellanox driver reports the VF-LAG feature as "active" before binding 
VFs to
  the driver.
  
  See also https://bugs.launchpad.net/netplan/+bug/2083008
  
  This problem is fixed by introducing a proper ordering in the configuration 
process
  and monitoring the driver state until it reports as ready (or times out).
  
- 2) Impossibility to set the embedded switch mode without Virtual
- Functions - https://bugs.launchpad.net/netplan/+bug/2020409
- 
- Netplan wouldn't allow setting the e-switch mode without having Virtual 
Functions
- defined in the YAML. Setting the e-switch mode should be allowed 
independently of
- the existence of Virtual Functions.
- This problem prevents the use of Scalable Functions without SR-IOV. 
- 
- 
  [ Test Plan ]
  
- To reproduce the problems addressed by this SRU one needs to
- have access to Mellanox network interfaces that support SR-IOV.
+ To reproduce the problem addressed by this SRU one needs to
+ have access to specialized hardware (SR-IOV-capable Mellanox NICs).
  
- In this particular case we'll need help from the bug reporters 
(https://bugs.launchpad.net/netplan/+bug/2083008)
- to install and test the new netplan.io version in production.
- 
- The fixes for the problem 1) described above were already verified and
+ The fix for the problem described above was already verified on Ubuntu 22.04 
and
  solved the problem (more details 
https://bugs.launchpad.net/netplan/+bug/2083008).
  
- The fixes for the problem 2) were tested on real hardware when they were 
implemented
- (see https://github.com/canonical/netplan/pull/454 for details) but still 
need to be
- tested on Ubuntu 22.04.
+ We will work with Canonical's Openstack team to do the fix verification.
  
   * detailed instructions how to reproduce the bug
- 
- Problem 1)
  
  A configuration file that looks like the one below can be used
  to test the fix.
  
  After booting the system with this configuration, the Mellanox driver
- should report the LAG state as "active".
+ should report the LAG state as "active" for all the devices.
  It can be checked in the debugfs file: 
/sys/kernel/debug/mlx5/{pci_addr}/lag/state
  
  network:
    version: 2
    ethernets:
      ens4f0np0:
        virtual-function-count: 16
        embedded-switch-mode: switchdev
        delay-virtual-functions-rebind: true
  
      ens4f1np1:
        virtual-function-count: 16
        embedded-switch-mode: switchdev
        delay-virtual-functions-rebind: true
  
    bonds:
      bond0:
        interfaces:
          - ens4f0np0
          - ens4f1np1
        parameters:
          mode: active-backup
  
- Problem 2)
- 
- A configuration like the below can be used to test if the e-switch mode
- can be set to "switchdev" without Virtual Functions:
- 
- network:
-   version: 2
-   ethernets:
-     enp3s0f0np0:
-       match:
-         macaddress: 98:03:9b:c3:ef:ba
-       mtu: 9000
-       set-name: enp3s0f0np0
-       embedded-switch-mode: switchdev
-     enp3s0f1np1:
-       match:
-         macaddress: 98:03:9b:c3:ef:bb
-       mtu: 9000
-       set-name: enp3s0f1np1
-       embedded-switch-mode: switchdev
- 
- After applying the configuration, the e-switch mode can be checked with
- the devlink tool. For example:
- 
- root@node-laveran:~# devlink dev eswitch show pci/0000:03:00.0
- pci/0000:03:00.0: mode switchdev inline-mode none encap-mode basic
- root@node-laveran:~# devlink dev eswitch show pci/0000:03:00.1
- pci/0000:03:00.1: mode switchdev inline-mode none encap-mode basic
- 
  [ Where problems could occur ]
  
  These changes should affect only SR-IOV related scenarios.
  Undetected problems could cause Netplan to fail to configure the device
  and Virtual Functions wouldn't be created anymore.
  
  [ Other Info ]
  
  Related work:
  
  https://bugs.launchpad.net/ubuntu/+source/netplan.io/+bug/1988018
- https://bugs.launchpad.net/netplan/+bug/2020409
  https://github.com/canonical/netplan/pull/439
- https://github.com/canonical/netplan/pull/454
+ 
+ A PPA for Ubuntu 22.04 can be found here
+ https://launchpad.net/~danilogondolfo/+archive/ubuntu/netplan-sru
  
  
  ---- Original bug description ----
  
  During system initialization there is a specific sequence that must be
  followed to enable the use of hardware offload and VF-LAG.
  
  Intermittently one may see that VF-LAG initialization fails:
  [Thu Jul 21 10:54:58 2022] mlx5_core 0000:08:00.0: lag map port 1:1 port 2:2 
shared_fdb:1
  [Thu Jul 21 10:54:58 2022] mlx5_core 0000:08:00.0: mlx5_cmd_check:782:(pid 
9): CREATE_LAG(0x840) op_mod(0x0) failed, status bad parameter(0x3), syndrome 
(0x7d49cb)
  [Thu Jul 21 10:54:58 2022] mlx5_core 0000:08:00.0: mlx5_create_lag:248:(pid 
9): Failed to create LAG (-22)
  [Thu Jul 21 10:54:58 2022] mlx5_core 0000:08:00.0: mlx5_activate_lag:288:(pid 
9): Failed to activate VF LAG
                             Make sure all VFs are unbound prior to VF LAG 
activation or deactivation
  
  This is caused by rebinding the driver prior to the VF lag being ready.
  
  A sysfs knob has recently been added to the driver [0] and we should
  monitor it before attempting to rebind the driver:
  
      $ cat /sys/kernel/debug/mlx5/0000\:08\:00.0/lag/state
  
  The kernel feature is available in the upcoming Kinetic 5.19 kernel and
  we should probably backport it to the Jammy 5.15 kernel.
  
  0:
  
https://github.com/torvalds/linux/commit/7f46a0b7327ae261f9981888708dbca22c283900

** Description changed:

  [ Impact ]
  
  Due to limitations in how Netplan handles SR-IOV devices, the VF-LAG
  feature found on Mellanox NICs couldn't be used. Certain configuration steps
  must happen in a very specific order and Netplan fails to perform the set up 
correctly.
  
  Netplan must wait until the backend finishes adding interfaces to the Bond
  and the Mellanox driver reports the VF-LAG feature as "active" before binding 
VFs to
  the driver.
  
  See also https://bugs.launchpad.net/netplan/+bug/2083008
  
  This problem is fixed by introducing a proper ordering in the configuration 
process
  and monitoring the driver state until it reports as ready (or times out).
  
+ This fix is available on Ubuntu 24.04.
+ 
  [ Test Plan ]
  
  To reproduce the problem addressed by this SRU one needs to
  have access to specialized hardware (SR-IOV-capable Mellanox NICs).
  
  The fix for the problem described above was already verified on Ubuntu 22.04 
and
  solved the problem (more details 
https://bugs.launchpad.net/netplan/+bug/2083008).
  
  We will work with Canonical's Openstack team to do the fix verification.
  
-  * detailed instructions how to reproduce the bug
+  * detailed instructions how to reproduce the bug
  
  A configuration file that looks like the one below can be used
  to test the fix.
  
  After booting the system with this configuration, the Mellanox driver
  should report the LAG state as "active" for all the devices.
  It can be checked in the debugfs file: 
/sys/kernel/debug/mlx5/{pci_addr}/lag/state
  
  network:
-   version: 2
-   ethernets:
-     ens4f0np0:
-       virtual-function-count: 16
-       embedded-switch-mode: switchdev
-       delay-virtual-functions-rebind: true
+   version: 2
+   ethernets:
+     ens4f0np0:
+       virtual-function-count: 16
+       embedded-switch-mode: switchdev
+       delay-virtual-functions-rebind: true
  
-     ens4f1np1:
-       virtual-function-count: 16
-       embedded-switch-mode: switchdev
-       delay-virtual-functions-rebind: true
+     ens4f1np1:
+       virtual-function-count: 16
+       embedded-switch-mode: switchdev
+       delay-virtual-functions-rebind: true
  
-   bonds:
-     bond0:
-       interfaces:
-         - ens4f0np0
-         - ens4f1np1
-       parameters:
-         mode: active-backup
+   bonds:
+     bond0:
+       interfaces:
+         - ens4f0np0
+         - ens4f1np1
+       parameters:
+         mode: active-backup
  
  [ Where problems could occur ]
  
  These changes should affect only SR-IOV related scenarios.
  Undetected problems could cause Netplan to fail to configure the device
  and Virtual Functions wouldn't be created anymore.
  
  [ Other Info ]
  
  Related work:
  
  https://bugs.launchpad.net/ubuntu/+source/netplan.io/+bug/1988018
  https://github.com/canonical/netplan/pull/439
  
  A PPA for Ubuntu 22.04 can be found here
  https://launchpad.net/~danilogondolfo/+archive/ubuntu/netplan-sru
- 
  
  ---- Original bug description ----
  
  During system initialization there is a specific sequence that must be
  followed to enable the use of hardware offload and VF-LAG.
  
  Intermittently one may see that VF-LAG initialization fails:
  [Thu Jul 21 10:54:58 2022] mlx5_core 0000:08:00.0: lag map port 1:1 port 2:2 
shared_fdb:1
  [Thu Jul 21 10:54:58 2022] mlx5_core 0000:08:00.0: mlx5_cmd_check:782:(pid 
9): CREATE_LAG(0x840) op_mod(0x0) failed, status bad parameter(0x3), syndrome 
(0x7d49cb)
  [Thu Jul 21 10:54:58 2022] mlx5_core 0000:08:00.0: mlx5_create_lag:248:(pid 
9): Failed to create LAG (-22)
  [Thu Jul 21 10:54:58 2022] mlx5_core 0000:08:00.0: mlx5_activate_lag:288:(pid 
9): Failed to activate VF LAG
                             Make sure all VFs are unbound prior to VF LAG 
activation or deactivation
  
  This is caused by rebinding the driver prior to the VF lag being ready.
  
  A sysfs knob has recently been added to the driver [0] and we should
  monitor it before attempting to rebind the driver:
  
      $ cat /sys/kernel/debug/mlx5/0000\:08\:00.0/lag/state
  
  The kernel feature is available in the upcoming Kinetic 5.19 kernel and
  we should probably backport it to the Jammy 5.15 kernel.
  
  0:
  
https://github.com/torvalds/linux/commit/7f46a0b7327ae261f9981888708dbca22c283900

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1988018

Title:
  [SRU][mlx5] Intermittent VF-LAG activation failure

Status in linux package in Ubuntu:
  Fix Committed
Status in netplan.io package in Ubuntu:
  Fix Released
Status in linux source package in Jammy:
  Confirmed
Status in netplan.io source package in Jammy:
  In Progress
Status in linux source package in Kinetic:
  Won't Fix
Status in netplan.io source package in Kinetic:
  Won't Fix
Status in linux source package in Mantic:
  Won't Fix
Status in netplan.io source package in Mantic:
  Won't Fix
Status in linux source package in Noble:
  Fix Committed
Status in netplan.io source package in Noble:
  Fix Released

Bug description:
  [ Impact ]

  Due to limitations in how Netplan handles SR-IOV devices, the VF-LAG
  feature found on Mellanox NICs couldn't be used. Certain configuration steps
  must happen in a very specific order and Netplan fails to perform the set up 
correctly.

  Netplan must wait until the backend finishes adding interfaces to the Bond
  and the Mellanox driver reports the VF-LAG feature as "active" before binding 
VFs to
  the driver.

  See also https://bugs.launchpad.net/netplan/+bug/2083008

  This problem is fixed by introducing a proper ordering in the configuration 
process
  and monitoring the driver state until it reports as ready (or times out).

  This fix is available on Ubuntu 24.04.

  [ Test Plan ]

  To reproduce the problem addressed by this SRU one needs to
  have access to specialized hardware (SR-IOV-capable Mellanox NICs).

  The fix for the problem described above was already verified on Ubuntu 22.04 
and
  solved the problem (more details 
https://bugs.launchpad.net/netplan/+bug/2083008).

  We will work with Canonical's Openstack team to do the fix
  verification.

   * detailed instructions how to reproduce the bug

  A configuration file that looks like the one below can be used
  to test the fix.

  After booting the system with this configuration, the Mellanox driver
  should report the LAG state as "active" for all the devices.
  It can be checked in the debugfs file: 
/sys/kernel/debug/mlx5/{pci_addr}/lag/state

  network:
    version: 2
    ethernets:
      ens4f0np0:
        virtual-function-count: 16
        embedded-switch-mode: switchdev
        delay-virtual-functions-rebind: true

      ens4f1np1:
        virtual-function-count: 16
        embedded-switch-mode: switchdev
        delay-virtual-functions-rebind: true

    bonds:
      bond0:
        interfaces:
          - ens4f0np0
          - ens4f1np1
        parameters:
          mode: active-backup

  [ Where problems could occur ]

  These changes should affect only SR-IOV related scenarios.
  Undetected problems could cause Netplan to fail to configure the device
  and Virtual Functions wouldn't be created anymore.

  [ Other Info ]

  Related work:

  https://bugs.launchpad.net/ubuntu/+source/netplan.io/+bug/1988018
  https://github.com/canonical/netplan/pull/439

  A PPA for Ubuntu 22.04 can be found here
  https://launchpad.net/~danilogondolfo/+archive/ubuntu/netplan-sru

  ---- Original bug description ----

  During system initialization there is a specific sequence that must be
  followed to enable the use of hardware offload and VF-LAG.

  Intermittently one may see that VF-LAG initialization fails:
  [Thu Jul 21 10:54:58 2022] mlx5_core 0000:08:00.0: lag map port 1:1 port 2:2 
shared_fdb:1
  [Thu Jul 21 10:54:58 2022] mlx5_core 0000:08:00.0: mlx5_cmd_check:782:(pid 
9): CREATE_LAG(0x840) op_mod(0x0) failed, status bad parameter(0x3), syndrome 
(0x7d49cb)
  [Thu Jul 21 10:54:58 2022] mlx5_core 0000:08:00.0: mlx5_create_lag:248:(pid 
9): Failed to create LAG (-22)
  [Thu Jul 21 10:54:58 2022] mlx5_core 0000:08:00.0: mlx5_activate_lag:288:(pid 
9): Failed to activate VF LAG
                             Make sure all VFs are unbound prior to VF LAG 
activation or deactivation

  This is caused by rebinding the driver prior to the VF lag being
  ready.

  A sysfs knob has recently been added to the driver [0] and we should
  monitor it before attempting to rebind the driver:

      $ cat /sys/kernel/debug/mlx5/0000\:08\:00.0/lag/state

  The kernel feature is available in the upcoming Kinetic 5.19 kernel
  and we should probably backport it to the Jammy 5.15 kernel.

  0:
  
https://github.com/torvalds/linux/commit/7f46a0b7327ae261f9981888708dbca22c283900

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1988018/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to