I think they are two distinct problems, and hopefully we would get a comment from NVIDIA/Mellanox as the statements in bug 2020409 contradicts the documentation [0] the current Netplan implementation is based on.
Martin may have more details, but wanted to mention that one of our suspected culprits is with how Netplan lays out the udev rules for VF activation [1]: 1) It takes a long time when many are configured, as opposed to the expectation in the comment. 2) The process appears to be executed multiple times, which combined with the fact it takes a long time in turn may end up clashing with both the networking backends creation of the bond and the systemd unit rebinding the VFs. Bug 2020409 also raises the question if there are any bond/LAG related system bringup quirks for systems using only Scalable Functions (SF) or a combination of SFs and VFs. I have yet to see any documentation about that. 0: https://enterprise-support.nvidia.com/s/article/Configuring-VF-LAG-using-TC 1: https://github.com/canonical/netplan/blob/a7e4be03918c986020650743cb6cf0934696ef0c/src/sriov.c#L107-L112 -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1988018 Title: [mlx5] Intermittent VF-LAG activation failure Status in linux package in Ubuntu: Fix Committed Status in netplan.io package in Ubuntu: Triaged Status in linux source package in Jammy: New Status in netplan.io source package in Jammy: In Progress Status in linux source package in Kinetic: Fix Committed Status in netplan.io source package in Kinetic: Won't Fix Bug description: During system initialization there is a specific sequence that must be followed to enable the use of hardware offload and VF-LAG. Intermittently one may see that VF-LAG initialization fails: [Thu Jul 21 10:54:58 2022] mlx5_core 0000:08:00.0: lag map port 1:1 port 2:2 shared_fdb:1 [Thu Jul 21 10:54:58 2022] mlx5_core 0000:08:00.0: mlx5_cmd_check:782:(pid 9): CREATE_LAG(0x840) op_mod(0x0) failed, status bad parameter(0x3), syndrome (0x7d49cb) [Thu Jul 21 10:54:58 2022] mlx5_core 0000:08:00.0: mlx5_create_lag:248:(pid 9): Failed to create LAG (-22) [Thu Jul 21 10:54:58 2022] mlx5_core 0000:08:00.0: mlx5_activate_lag:288:(pid 9): Failed to activate VF LAG Make sure all VFs are unbound prior to VF LAG activation or deactivation This is caused by rebinding the driver prior to the VF lag being ready. A sysfs knob has recently been added to the driver [0] and we should monitor it before attempting to rebind the driver: $ cat /sys/kernel/debug/mlx5/0000\:08\:00.0/lag/state The kernel feature is available in the upcoming Kinetic 5.19 kernel and we should probably backport it to the Jammy 5.15 kernel. 0: https://github.com/torvalds/linux/commit/7f46a0b7327ae261f9981888708dbca22c283900 To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1988018/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp