After a reboot with the configuration from the test case on Mellanox CX6, the system comes up in a half broken state (not all VFs created, "legacy" eswitch mode, LAG disabled):
ubuntu@romano:~$ sudo lshw -c network -businfo Bus info Device Class Description ============================================================ pci@0000:21:00.0 ens13f0np0 network BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet Controller pci@0000:21:00.1 ens13f1np1 network BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet Controller pci@0000:61:00.0 ens7f0 network MT2892 Family [ConnectX-6 Dx] pci@0000:61:00.1 ens7f1 network MT2892 Family [ConnectX-6 Dx] pci@0000:61:00.2 ens7f0v0 network ConnectX Family mlx5Gen Virtual Function pci@0000:61:00.3 ens7f0v1 network ConnectX Family mlx5Gen Virtual Function pci@0000:61:00.4 ens7f0v2 network ConnectX Family mlx5Gen Virtual Function pci@0000:61:00.5 ens7f0v3 network ConnectX Family mlx5Gen Virtual Function pci@0000:61:00.6 ens7f0v4 network ConnectX Family mlx5Gen Virtual Function pci@0000:61:00.7 ens7f0v5 network ConnectX Family mlx5Gen Virtual Function pci@0000:61:01.0 ens7f0v6 network ConnectX Family mlx5Gen Virtual Function pci@0000:61:01.1 ens7f0v7 network ConnectX Family mlx5Gen Virtual Function ubuntu@romano:~$ sudo devlink dev eswitch show pci/0000:61:00.0 pci/0000:61:00.0: mode legacy inline-mode none encap-mode basic ubuntu@romano:~$ sudo devlink dev eswitch show pci/0000:61:00.1 pci/0000:61:00.1: mode legacy inline-mode none encap-mode basic ubuntu@romano:~$ sudo cat /sys/kernel/debug/mlx5/0000:61:00.0/lag/state disabled ubuntu@romano:~$ sudo cat /sys/kernel/debug/mlx5/0000:61:00.1/lag/state disabled Interestingly, after running "netplan apply" after the reboot, the system seems to transition into the desired state: ubuntu@romano:~$ sudo netplan apply [] Cannot find unique matching interface for ens7f0 [] Cannot find unique matching interface for ens7f1 ubuntu@romano:~$ sudo lshw -c network -businfo Bus info Device Class Description ============================================================ pci@0000:21:00.0 ens13f0np0 network BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet Controller pci@0000:21:00.1 ens13f1np1 network BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet Controller pci@0000:61:00.0 ens7f0 network MT2892 Family [ConnectX-6 Dx] pci@0000:61:00.1 ens7f1 network MT2892 Family [ConnectX-6 Dx] pci@0000:61:00.2 network ConnectX Family mlx5Gen Virtual Function pci@0000:61:00.3 network ConnectX Family mlx5Gen Virtual Function pci@0000:61:00.4 network ConnectX Family mlx5Gen Virtual Function pci@0000:61:00.5 network ConnectX Family mlx5Gen Virtual Function pci@0000:61:00.6 network ConnectX Family mlx5Gen Virtual Function pci@0000:61:00.7 network ConnectX Family mlx5Gen Virtual Function pci@0000:61:01.0 network ConnectX Family mlx5Gen Virtual Function pci@0000:61:01.1 network ConnectX Family mlx5Gen Virtual Function pci@0000:61:01.2 network ConnectX Family mlx5Gen Virtual Function pci@0000:61:01.3 network ConnectX Family mlx5Gen Virtual Function pci@0000:61:01.4 network ConnectX Family mlx5Gen Virtual Function pci@0000:61:01.5 network ConnectX Family mlx5Gen Virtual Function pci@0000:61:01.6 network ConnectX Family mlx5Gen Virtual Function pci@0000:61:01.7 network ConnectX Family mlx5Gen Virtual Function pci@0000:61:02.0 network ConnectX Family mlx5Gen Virtual Function pci@0000:61:02.1 network ConnectX Family mlx5Gen Virtual Function pci@0000:61:00.0 ens7f0npf0vf0 network Ethernet interface pci@0000:61:00.0 ens7f0npf0vf1 network Ethernet interface pci@0000:61:00.0 ens7f0npf0vf2 network Ethernet interface pci@0000:61:00.0 ens7f0npf0vf3 network Ethernet interface pci@0000:61:00.0 ens7f0npf0vf4 network Ethernet interface pci@0000:61:00.0 ens7f0npf0vf5 network Ethernet interface pci@0000:61:00.0 ens7f0npf0vf6 network Ethernet interface pci@0000:61:00.0 ens7f0npf0vf7 network Ethernet interface pci@0000:61:00.1 ens7f1npf1vf0 network Ethernet interface pci@0000:61:00.1 ens7f1npf1vf1 network Ethernet interface pci@0000:61:00.1 ens7f1npf1vf2 network Ethernet interface pci@0000:61:00.1 ens7f1npf1vf3 network Ethernet interface pci@0000:61:00.1 ens7f1npf1vf4 network Ethernet interface pci@0000:61:00.1 ens7f1npf1vf5 network Ethernet interface pci@0000:61:00.1 ens7f1npf1vf6 network Ethernet interface pci@0000:61:00.1 ens7f1npf1vf7 network Ethernet interface ubuntu@romano:~$ sudo devlink dev eswitch show pci/0000:61:00.1 pci/0000:61:00.1: mode switchdev inline-mode none encap-mode basic ubuntu@romano:~$ sudo devlink dev eswitch show pci/0000:61:00.0 pci/0000:61:00.0: mode switchdev inline-mode none encap-mode basic ubuntu@romano:~$ sudo cat /sys/kernel/debug/mlx5/0000:61:00.1/lag/state active ubuntu@romano:~$ sudo cat /sys/kernel/debug/mlx5/0000:61:00.0/lag/state active This isn't quite as it should be, but might be an easy workaround for Mellanox CX6.. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1988018 Title: [SRU][mlx5] Intermittent VF-LAG activation failure Status in linux package in Ubuntu: Fix Committed Status in netplan.io package in Ubuntu: Fix Released Status in linux source package in Jammy: Confirmed Status in netplan.io source package in Jammy: Fix Committed Status in linux source package in Kinetic: Won't Fix Status in netplan.io source package in Kinetic: Won't Fix Status in linux source package in Mantic: Won't Fix Status in netplan.io source package in Mantic: Won't Fix Status in linux source package in Noble: Fix Committed Status in netplan.io source package in Noble: Fix Released Bug description: [ Impact ] Due to limitations in how Netplan handles SR-IOV devices, the VF-LAG feature found on Mellanox NICs couldn't be used. Certain configuration steps must happen in a very specific order and Netplan fails to perform the set up correctly. Netplan must wait until the backend finishes adding interfaces to the Bond and the Mellanox driver reports the VF-LAG feature as "active" before binding VFs to the driver. See also https://bugs.launchpad.net/netplan/+bug/2083008 This problem is fixed by introducing a proper ordering in the configuration process and monitoring the driver state until it reports as ready (or times out). This fix is available on Ubuntu 24.04. [ Test Plan ] To reproduce the problem addressed by this SRU one needs to have access to specialized hardware (SR-IOV-capable Mellanox NICs). The fix for the problem described above was already verified on Ubuntu 22.04 and solved the problem (more details https://bugs.launchpad.net/netplan/+bug/2083008). We will work with Canonical's Openstack team to do the fix verification. * detailed instructions how to reproduce the bug A configuration file that looks like the one below can be used to test the fix. After booting the system with this configuration, the Mellanox driver should report the LAG state as "active" for all the devices. It can be checked in the debugfs file: /sys/kernel/debug/mlx5/{pci_addr}/lag/state network: version: 2 ethernets: ens4f0np0: virtual-function-count: 16 embedded-switch-mode: switchdev delay-virtual-functions-rebind: true ens4f1np1: virtual-function-count: 16 embedded-switch-mode: switchdev delay-virtual-functions-rebind: true bonds: bond0: interfaces: - ens4f0np0 - ens4f1np1 parameters: mode: active-backup [ Where problems could occur ] These changes should affect only SR-IOV related scenarios. Undetected problems could cause Netplan to fail to configure the device and Virtual Functions wouldn't be created anymore. [ Other Info ] Related work: https://bugs.launchpad.net/ubuntu/+source/netplan.io/+bug/1988018 https://github.com/canonical/netplan/pull/439 A PPA for Ubuntu 22.04 can be found here https://launchpad.net/~danilogondolfo/+archive/ubuntu/netplan-sru ---- Original bug description ---- During system initialization there is a specific sequence that must be followed to enable the use of hardware offload and VF-LAG. Intermittently one may see that VF-LAG initialization fails: [Thu Jul 21 10:54:58 2022] mlx5_core 0000:08:00.0: lag map port 1:1 port 2:2 shared_fdb:1 [Thu Jul 21 10:54:58 2022] mlx5_core 0000:08:00.0: mlx5_cmd_check:782:(pid 9): CREATE_LAG(0x840) op_mod(0x0) failed, status bad parameter(0x3), syndrome (0x7d49cb) [Thu Jul 21 10:54:58 2022] mlx5_core 0000:08:00.0: mlx5_create_lag:248:(pid 9): Failed to create LAG (-22) [Thu Jul 21 10:54:58 2022] mlx5_core 0000:08:00.0: mlx5_activate_lag:288:(pid 9): Failed to activate VF LAG Make sure all VFs are unbound prior to VF LAG activation or deactivation This is caused by rebinding the driver prior to the VF lag being ready. A sysfs knob has recently been added to the driver [0] and we should monitor it before attempting to rebind the driver: $ cat /sys/kernel/debug/mlx5/0000\:08\:00.0/lag/state The kernel feature is available in the upcoming Kinetic 5.19 kernel and we should probably backport it to the Jammy 5.15 kernel. 0: https://github.com/torvalds/linux/commit/7f46a0b7327ae261f9981888708dbca22c283900 To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1988018/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp