Public bug reported: [Impact] In BF5.15 (Jammy), CX7 cards experience consistent CQ errors with syndrome 0x1 when running a performance script: mlx5_core 0000:08:00.0: cq_err_event_notifier:538:(pid 9712): CQ error on CQN 0x424, syndrome 0x1
Multiple call traces appear in dmesg and the system becomes unresponsive. The test may require multiple iterations to trigger the issue. The root cause appears to be a missing upstream fix that can lead to crashes or warnings when netlink policy is not found, potentially causing the observed CQ errors during high-connection testing scenarios. [Fix] Cherry picking the upstream commit: 154ba79c9f16 ("genetlink: correctly begin the iteration over policies") This commit fixes incorrect initialization in genl_op_iter_init() by ensuring genl_op_iter_next() is called to properly begin the iteration. The fix prevents crashes and warnings in netlink_policy_dump_get_policy_idx() when policy is not found, which may be contributing to the CQ error condition during intensive connection testing. [Test Case] Compile tested on linux-bluefield-5.15 on the master-next branch. Functional testing involves: Running the test with multiple iterations on CX7 hardware with a linux-bluefield-5.15 kernel that includes the fix. With the patch applied, the test should complete without CQ errors and system should remain responsive. [Regression Potential] The change is minimal and matches the upstream implementation exactly. ** Affects: linux-bluefield (Ubuntu) Importance: Undecided Status: New -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2117349 Title: Ubuntu 22.04: CQ errors causing system unresponsiveness To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-bluefield/+bug/2117349/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs