** Description changed: + [ Impact ] + + Due to limitations in how Netplan handles SR-IOV devices, features such as + VF-LAG and Scalable Functions couldn't be used. Certain configuration steps + must happen in a very specific order and Netplan fails to perform the set up correctly. + + This SRU addresses the following two problems: + + 1) Fail to activate Mellanox VF-LAG - + https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1988018 + + Netplan must wait until the backend finishes adding interfaces to the Bond + and the Mellanox driver reports the VF-LAG feature as "active" before binding VFs to + the driver. + + See also https://bugs.launchpad.net/netplan/+bug/2083008 + + This problem is fixed by introducing a proper ordering in the configuration process + and monitoring the driver state until it reports as ready (or times out). + + 2) Impossibility to set the embedded switch mode without Virtual + Functions - https://bugs.launchpad.net/netplan/+bug/2020409 + + Netplan wouldn't allow setting the e-switch mode without having Virtual Functions + defined in the YAML. Setting the e-switch mode should be allowed independently of + the existence of Virtual Functions. + This problem prevents the use of Scalable Functions without SR-IOV. + + + [ Test Plan ] + + To reproduce the problems addressed by this SRU one needs to + have access to Mellanox network interfaces that support SR-IOV. + + In this particular case we'll need help from the bug reporters (https://bugs.launchpad.net/netplan/+bug/2083008) + to install and test the new netplan.io version in production. + + The fixes for the problem 1) described above were already verified and + solved the problem (more details https://bugs.launchpad.net/netplan/+bug/2083008). + + The fixes for the problem 2) were tested on real hardware when they were implemented + (see https://github.com/canonical/netplan/pull/454 for details) but still need to be + tested on Ubuntu 22.04. + + * detailed instructions how to reproduce the bug + + Problem 1) + + A configuration file that looks like the one below can be used + to test the fix. + + After booting the system with this configuration, the Mellanox driver + should report the LAG state as "active". + It can be checked in the debugfs file: /sys/kernel/debug/mlx5/{pci_addr}/lag/state + + network: + version: 2 + ethernets: + ens4f0np0: + virtual-function-count: 16 + embedded-switch-mode: switchdev + delay-virtual-functions-rebind: true + + ens4f1np1: + virtual-function-count: 16 + embedded-switch-mode: switchdev + delay-virtual-functions-rebind: true + + bonds: + bond0: + interfaces: + - ens4f0np0 + - ens4f1np1 + parameters: + mode: active-backup + + Problem 2) + + A configuration like the below can be used to test if the e-switch mode + can be set to "switchdev" without Virtual Functions: + + network: + version: 2 + ethernets: + enp3s0f0np0: + match: + macaddress: 98:03:9b:c3:ef:ba + mtu: 9000 + set-name: enp3s0f0np0 + embedded-switch-mode: switchdev + enp3s0f1np1: + match: + macaddress: 98:03:9b:c3:ef:bb + mtu: 9000 + set-name: enp3s0f1np1 + embedded-switch-mode: switchdev + + After applying the configuration, the e-switch mode can be checked with + the devlink tool. For example: + + root@node-laveran:~# devlink dev eswitch show pci/0000:03:00.0 + pci/0000:03:00.0: mode switchdev inline-mode none encap-mode basic + root@node-laveran:~# devlink dev eswitch show pci/0000:03:00.1 + pci/0000:03:00.1: mode switchdev inline-mode none encap-mode basic + + [ Where problems could occur ] + + These changes should affect only SR-IOV related scenarios. + Undetected problems could cause Netplan to fail to configure the device + and Virtual Functions wouldn't be created anymore. + + [ Other Info ] + + Related work: + + https://bugs.launchpad.net/ubuntu/+source/netplan.io/+bug/1988018 + https://bugs.launchpad.net/netplan/+bug/2020409 + https://github.com/canonical/netplan/pull/439 + https://github.com/canonical/netplan/pull/454 + + + ---- Original bug description ---- + During system initialization there is a specific sequence that must be followed to enable the use of hardware offload and VF-LAG. Intermittently one may see that VF-LAG initialization fails: [Thu Jul 21 10:54:58 2022] mlx5_core 0000:08:00.0: lag map port 1:1 port 2:2 shared_fdb:1 [Thu Jul 21 10:54:58 2022] mlx5_core 0000:08:00.0: mlx5_cmd_check:782:(pid 9): CREATE_LAG(0x840) op_mod(0x0) failed, status bad parameter(0x3), syndrome (0x7d49cb) [Thu Jul 21 10:54:58 2022] mlx5_core 0000:08:00.0: mlx5_create_lag:248:(pid 9): Failed to create LAG (-22) [Thu Jul 21 10:54:58 2022] mlx5_core 0000:08:00.0: mlx5_activate_lag:288:(pid 9): Failed to activate VF LAG - Make sure all VFs are unbound prior to VF LAG activation or deactivation + Make sure all VFs are unbound prior to VF LAG activation or deactivation This is caused by rebinding the driver prior to the VF lag being ready. A sysfs knob has recently been added to the driver [0] and we should monitor it before attempting to rebind the driver: - $ cat /sys/kernel/debug/mlx5/0000\:08\:00.0/lag/state + $ cat /sys/kernel/debug/mlx5/0000\:08\:00.0/lag/state The kernel feature is available in the upcoming Kinetic 5.19 kernel and we should probably backport it to the Jammy 5.15 kernel. 0: https://github.com/torvalds/linux/commit/7f46a0b7327ae261f9981888708dbca22c283900
** Summary changed: - [mlx5] Intermittent VF-LAG activation failure + [SRU][mlx5] Intermittent VF-LAG activation failure -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1988018 Title: [SRU][mlx5] Intermittent VF-LAG activation failure Status in linux package in Ubuntu: Fix Committed Status in netplan.io package in Ubuntu: Fix Released Status in linux source package in Jammy: New Status in netplan.io source package in Jammy: In Progress Status in linux source package in Kinetic: Won't Fix Status in netplan.io source package in Kinetic: Won't Fix Status in linux source package in Mantic: Won't Fix Status in netplan.io source package in Mantic: Won't Fix Status in linux source package in Noble: Fix Committed Status in netplan.io source package in Noble: Fix Released Bug description: [ Impact ] Due to limitations in how Netplan handles SR-IOV devices, features such as VF-LAG and Scalable Functions couldn't be used. Certain configuration steps must happen in a very specific order and Netplan fails to perform the set up correctly. This SRU addresses the following two problems: 1) Fail to activate Mellanox VF-LAG - https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1988018 Netplan must wait until the backend finishes adding interfaces to the Bond and the Mellanox driver reports the VF-LAG feature as "active" before binding VFs to the driver. See also https://bugs.launchpad.net/netplan/+bug/2083008 This problem is fixed by introducing a proper ordering in the configuration process and monitoring the driver state until it reports as ready (or times out). 2) Impossibility to set the embedded switch mode without Virtual Functions - https://bugs.launchpad.net/netplan/+bug/2020409 Netplan wouldn't allow setting the e-switch mode without having Virtual Functions defined in the YAML. Setting the e-switch mode should be allowed independently of the existence of Virtual Functions. This problem prevents the use of Scalable Functions without SR-IOV. [ Test Plan ] To reproduce the problems addressed by this SRU one needs to have access to Mellanox network interfaces that support SR-IOV. In this particular case we'll need help from the bug reporters (https://bugs.launchpad.net/netplan/+bug/2083008) to install and test the new netplan.io version in production. The fixes for the problem 1) described above were already verified and solved the problem (more details https://bugs.launchpad.net/netplan/+bug/2083008). The fixes for the problem 2) were tested on real hardware when they were implemented (see https://github.com/canonical/netplan/pull/454 for details) but still need to be tested on Ubuntu 22.04. * detailed instructions how to reproduce the bug Problem 1) A configuration file that looks like the one below can be used to test the fix. After booting the system with this configuration, the Mellanox driver should report the LAG state as "active". It can be checked in the debugfs file: /sys/kernel/debug/mlx5/{pci_addr}/lag/state network: version: 2 ethernets: ens4f0np0: virtual-function-count: 16 embedded-switch-mode: switchdev delay-virtual-functions-rebind: true ens4f1np1: virtual-function-count: 16 embedded-switch-mode: switchdev delay-virtual-functions-rebind: true bonds: bond0: interfaces: - ens4f0np0 - ens4f1np1 parameters: mode: active-backup Problem 2) A configuration like the below can be used to test if the e-switch mode can be set to "switchdev" without Virtual Functions: network: version: 2 ethernets: enp3s0f0np0: match: macaddress: 98:03:9b:c3:ef:ba mtu: 9000 set-name: enp3s0f0np0 embedded-switch-mode: switchdev enp3s0f1np1: match: macaddress: 98:03:9b:c3:ef:bb mtu: 9000 set-name: enp3s0f1np1 embedded-switch-mode: switchdev After applying the configuration, the e-switch mode can be checked with the devlink tool. For example: root@node-laveran:~# devlink dev eswitch show pci/0000:03:00.0 pci/0000:03:00.0: mode switchdev inline-mode none encap-mode basic root@node-laveran:~# devlink dev eswitch show pci/0000:03:00.1 pci/0000:03:00.1: mode switchdev inline-mode none encap-mode basic [ Where problems could occur ] These changes should affect only SR-IOV related scenarios. Undetected problems could cause Netplan to fail to configure the device and Virtual Functions wouldn't be created anymore. [ Other Info ] Related work: https://bugs.launchpad.net/ubuntu/+source/netplan.io/+bug/1988018 https://bugs.launchpad.net/netplan/+bug/2020409 https://github.com/canonical/netplan/pull/439 https://github.com/canonical/netplan/pull/454 ---- Original bug description ---- During system initialization there is a specific sequence that must be followed to enable the use of hardware offload and VF-LAG. Intermittently one may see that VF-LAG initialization fails: [Thu Jul 21 10:54:58 2022] mlx5_core 0000:08:00.0: lag map port 1:1 port 2:2 shared_fdb:1 [Thu Jul 21 10:54:58 2022] mlx5_core 0000:08:00.0: mlx5_cmd_check:782:(pid 9): CREATE_LAG(0x840) op_mod(0x0) failed, status bad parameter(0x3), syndrome (0x7d49cb) [Thu Jul 21 10:54:58 2022] mlx5_core 0000:08:00.0: mlx5_create_lag:248:(pid 9): Failed to create LAG (-22) [Thu Jul 21 10:54:58 2022] mlx5_core 0000:08:00.0: mlx5_activate_lag:288:(pid 9): Failed to activate VF LAG Make sure all VFs are unbound prior to VF LAG activation or deactivation This is caused by rebinding the driver prior to the VF lag being ready. A sysfs knob has recently been added to the driver [0] and we should monitor it before attempting to rebind the driver: $ cat /sys/kernel/debug/mlx5/0000\:08\:00.0/lag/state The kernel feature is available in the upcoming Kinetic 5.19 kernel and we should probably backport it to the Jammy 5.15 kernel. 0: https://github.com/torvalds/linux/commit/7f46a0b7327ae261f9981888708dbca22c283900 To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1988018/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp