Public bug reported:

[Impact]
In BF5.15 (Jammy), CX7 cards experience consistent CQ errors with syndrome 0x1 
when running a performance script:
mlx5_core 0000:08:00.0: cq_err_event_notifier:538:(pid 9712): CQ error on CQN 
0x424, syndrome 0x1

Multiple call traces appear in dmesg and the system becomes unresponsive. The 
test may require multiple iterations to trigger the issue.
The root cause appears to be a missing upstream fix that can lead to crashes or 
warnings when netlink policy is not found, potentially causing the observed CQ 
errors during high-connection testing scenarios.

[Fix]
Cherry picking the upstream commit:
154ba79c9f16 ("genetlink: correctly begin the iteration over policies")

This commit fixes incorrect initialization in genl_op_iter_init() by
ensuring genl_op_iter_next() is called to properly begin the iteration.
The fix prevents crashes and warnings in
netlink_policy_dump_get_policy_idx() when policy is not found, which may
be contributing to the CQ error condition during intensive connection
testing.

[Test Case]
Compile tested on linux-bluefield-5.15 on the master-next branch.
Functional testing involves:
Running the test with multiple iterations on CX7 hardware with a 
linux-bluefield-5.15 kernel that includes the fix. With the patch applied, the 
test should complete without CQ errors and system should remain responsive.

[Regression Potential]
The change is minimal and matches the upstream implementation exactly.

** Affects: linux-bluefield (Ubuntu)
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2117349

Title:
  Ubuntu 22.04: CQ errors causing system unresponsiveness

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-bluefield/+bug/2117349/+subscriptions


-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to