Well, yes - I just ran into this, while I wanted to proceed on this ticket.
Things definitely do not apply on focal master next (git clone git://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/focal --branch master-next --single-branch). Even the first commit ad11c4f1d8fd is causing a significant conflict, since file "drivers/net/ethernet/mellanox/mlx5/core/lag/mp.c" does not exist (anymore?). So I unfortunately cannot apply these commits as they are: ad11c4f1d8fd ad11c4f1d8fd1f03639460e425a36f7fd0ea83f5 "net/mlx5e: Lag, Only handle events from highest priority multipath entry" 27b0420fd959 27b0420fd959e38e3500e60b637d39dfab065645 "net/mlx5e: Lag, Fix use-after-free in fib event handler" a6589155ec98 a6589155ec9847918e00e7279b8aa6d4c272bea7 "net/mlx5e: Lag, Fix fib_info pointer assignment" 4a2a664ed879 4a2a664ed87962c4ddb806a84b5c9634820bcf55 "net/mlx5e: Lag, Don't skip fib events on current dst" I may try to have a look what's going on, but would like to ask you also to double-check if there is anything missing ... -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1990275 Title: [UBUNTU 20.04] Unexpected LAG affinity behaviour with mlx5_core driver in Ubuntu 20.04 Status in Ubuntu on IBM z Systems: New Status in linux package in Ubuntu: Invalid Status in linux source package in Focal: New Bug description: == Comment: #0 - KISHORE KUMAR G <kishor...@in.ibm.com> - 2022-09-19 04:39:42 == ---Problem Description--- On a Ubuntu/s390 system that houses a Mellanox CX5 Adapter with two ports connected to the a pair of TOR switches , act as entry point to cluster of compute nodes to access public network ( edge node) with following level of mlx firmware : ethtool -i p0 driver: mlx5e_rep version: 5.4.0-104.118- firmware-version: 16.27.1016 (MT_0000000013) expansion-rom-version: bus-info: 0100:00:00.0 supports-statistics: yes supports-test: no supports-eeprom-access: no supports-register-dump: no supports-priv-flags: no The LAG affinity module of mlx5_core in upstream 5.4 kernel listens to routing events and sets the LAG affinity accordingly , whereas in one of custom services has Fabcon service listens to the routing events and sets the LAG affinity in the mellanox driver accordingly. The edge node routes defined in compute nodes use both the two interfaces (port1 -P0 and port2- P1) for the LAG affinity. For instance 10.66.0.170 proto bgp src 10.66.11.43 metric 20 nexthop via 172.31.22.42 dev p0 weight 1 nexthop via 172.31.22.170 dev p1 weight 1 As an example post an edge node bootup , LAG mapping gets converged to use both port1(P0) and port2 (P1) by default root@pok1-qz1-sr1-rk011-s20:/# dmesg | grep lag [ 282.043011] mlx5_core 0100:00:00.0: lag map port 1:2 port 2:2 [ 282.083541] mlx5_core 0100:00:00.0: modify lag map port 1:1 port 2:2 (<------ Both ports are equally mapped) The issue comes, when the mlx5_core driver cannot derive the LAG configuration from specific routes. For instance,an operation of disabling an interface from edge node above (10.66.0.170) or addition/removal of the interface, causes mlx5_core driver to listen on the routing change and change the LAG affinity to use a single network interface only. In the following example ,a new static route entry to a single destination (10.66.47.34) is added as below ip route add 10.66.47.34 proto static src 10.66.11.43 metric 20 via 172.31.22.42 dev p0 Caused the LAG mapping change to port1(p0) as detected as following root@pok1-qz1-sr1-rk011-s20:/# dmesg | grep lag [ 282.043011] mlx5_core 0100:00:00.0: lag map port 1:2 port 2:2 [ 282.083541] mlx5_core 0100:00:00.0: modify lag map port 1:1 port 2:2 [ 757.878626] mlx5_core 0100:00:00.0: modify lag map port 1:1 port 2:1 <----mapping directs to go thru P0. The above behaviour, causes all the traffic in 10.x to use a single network interface. The TOR switches (Fabric) doesn't capture or know such a LAG affinity change and therefore the packets will be dropped on "not in use" interface ( Eg. Port 2 (P1) ). So the mellanox(mlx5_core) should not be changing the LAG mapping /config based on the last route event, rather should rely on the default routes only. Mellanox agreed to patch this and its is available in 5.15.29 Ubuntu and 5.15.39 respectively Following are the commits that resolves this issue . 1. net/mlx5e: Lag,Only handle events from highest priority multipath entry . Available in upstream Kernel 5.15.29 - https://github.com/torvalds/linux/commit/ad11c4f1d8fd1f03639460e425a36f7fd0ea83f5 2.net/mlx5e: Lag, Don't skip fib events on current dst . (5.15.29)https://github.com/torvalds/linux/commit/4a2a664ed87962c4ddb806a84b5c9634820bcf55 )3. net/mlx5e: Lag, Fix fib_info pointer assignment - ( 5.15.39 ) https://github.com/torvalds/linux/commit/a6589155ec9847918e00e7279b8aa6d4c272bea7 4. net/mlx5e: Lag, Fix use-after-free in fib event handler - (5.15.39) https://github.com/torvalds/linux/commit/27b0420fd959e38e3500e60b637d39dfab065645 The request is to have the above commits backported in Ubuntu 20.04.x series including the Ubuntu 18.04 HWE kernel Contact Information = Kishore Kumar G/kishore.pil...@in.ibm.com utsav.shrivas...@ibm.com ---Additional Hardware Info--- Mellanox CX5 adapter with firmware-version: 16.27.1016 (MT_0000000013) ---uname output--- Linux version version: 5.4.0-104.118 Machine Type = s390x LPAR ---Debugger--- A debugger is not configured ---Steps to Reproduce--- ... " default proto bgp src 10.66.11.41 metric 20 nexthop via 172.31.22.40 dev p0 weight 1 nexthop via 172.31.22.168 dev p1 weight 1" ...... 172.31.22.40/31 dev p0 proto kernel scope link src 172.31.22.41 172.31.22.168/31 dev p1 proto kernel scope link src 172.31.22.169 .. Also we have around 64 SRIOV devices for VM Consumption. In the above case, the LAG mapping is working as expected as below, to use both the ports (p0 and p1) for traffic root@pok1-qz1-sr1-rk011-s20:/# dmesg | grep lag [ 282.043011] mlx5_core 0100:00:00.0: lag map port 1:2 port 2:2 [ 282.083541] mlx5_core 0100:00:00.0: modify lag map port 1:1 port 2:2 <<<---behavior expected The issue comes , when we set an additional route to a single IP in the underlying network with a single/one next hop , we observe that all the traffic is being shifted to a single next hop port as the example below shows. root@pok1-qz1-sr1-rk011-s20:/# ip route add 10.66.47.34 proto static src 10.66.11.41 metric 20 via 172.31.22.40 dev p0 root@pok1-qz1-sr1-rk011-s20:/# dmesg | grep lag [ 282.043011] mlx5_core 0100:00:00.0: lag map port 1:2 port 2:2 [ 282.083541] mlx5_core 0100:00:00.0: modify lag map port 1:1 port 2:2 [ 757.878626] mlx5_core 0100:00:00.0: modify lag map port 1:1 port 2:1 <<<<------- Issue Stack trace output: no Oops output: no System Dump Info: The system is not configured to capture a system dump. *Additional Instructions for Kishore Kumar G/kishore.pil...@in.ibm.com utsav.shrivas...@ibm.com: -Attach sysctl -a output output to the bug. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu-z-systems/+bug/1990275/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp