Public bug reported: [Impact] In Ubuntu-bluefield-5.15.0-1071.73, which included commits from upstream stable version 5.15.183, the system crashes after building the kernel, building OFED driver and restarting the driver:
Oops: 0000 [#1] SMP NOPTI Workqueue: events kfree_rcu_work RIP: 0010:kmem_cache_free_bulk+0x137/0x1d0 Call Trace: kfree_rcu_work+0x1e7/0x250 process_one_work+0x1b0/0x350 worker_thread+0x50/0x3a0 kthread+0x124/0x150 ret_from_fork+0x1f/0x30 The crash is caused by using k[v]free_rcu_mightsleep() functions, that were introduced by the faulty commit 5dc583481a0a ("Add kvfree_rcu_mightsleep() and kfree_rcu_mightsleep()"). This commit introduces new mightsleep functions but lacks critical infrastructure changes required for proper operation. Our analysis indicates the root cause is an incomplete API migration, which causes mightsleep macros to pass void pointers where rcu_callback_t function pointers are expected: BF5.15 (broken): void kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func) Required: void kvfree_call_rcu(struct rcu_head *head, void *ptr) This results in invalid pointer arithmetic that generates tiny memory addresses (like 0x17) which crash the kernel when freed. [Fix] Phase 1 (Immediate): Revert the problematic commit to restore stability, along with the two other commits from the same series: * 57818f6fec6c ("can: gw: fix RCU/BH usage in cgw_create_job()") * 5dc583481a0a ("rcu/kvfree: Add kvfree_rcu_mightsleep() and kfree_rcu_mightsleep()") (main problematic commit) * 82683fabcb28 ("can: gw: use call_rcu() instead of costly synchronize_rcu()") Phase 2 (Proper Implementation): The results of our research should be verified and applied into Jammy to enable proper *_mightsleep() support for OFED driver. The most critical commit to verify and apply is the upstream commit introducing the kvfree_call_rcu() signature transformation: * 04a522b7da3d ("rcu: Refactor kvfree_call_rcu() and high-level helpers") Additionally, the following commits should be examined to determine whether they are essential for avoiding future issues: * 7e3f926bf453 ("rcu/kvfree: Eliminate k[v]free_rcu() single argument macro") * 5da7cb193db3 ("rcu/kvfree: Avoid freeing new kfree_rcu() memory after old grace period") * 23532061ad30 ("net/mlx5: Rename kfree_rcu() to kfree_rcu_mightsleep()") A deeper investigation should also be conducted to ensure no additional crucial commits are required for proper integration of this feature into Jammy. Once all necessary commits are backported, the *_mightsleep() functions can be safely re-introduced into Jammy. [Test Case] Phase 1: After reverting the three commits mentioned above, the compilation completed successfully on the master-next branch. After reverting, compiling the kernel, rebooting, building OFED and restarting the driver, no crash occurred. Phase 2: After applying all required infrastructure commits and re-adding mightsleep functions, system should remain stable when building OFED and restarting. [Regression Potential] Phase 1 (Revert): Very low risk. Simply removes the problematic new functionality and returns to the stable state that existed before the faulty commit. Phase 2 (Proper implementation): Medium risk as it requires backporting multiple upstream RCU infrastructure changes to an older kernel base. ** Affects: linux-bluefield (Ubuntu) Importance: Undecided Status: New -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2117163 Title: Jammy: Bluefield5.15 Kernel Crashes with kfree_rcu_mightsleep Functions To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-bluefield/+bug/2117163/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs