Public bug reported:

[Impact]
In Ubuntu-bluefield-5.15.0-1071.73, which included commits from upstream stable 
version 5.15.183, the system crashes after building the kernel, building OFED 
driver and restarting the driver:

Oops: 0000 [#1] SMP NOPTI
Workqueue: events kfree_rcu_work
RIP: 0010:kmem_cache_free_bulk+0x137/0x1d0
Call Trace:
 kfree_rcu_work+0x1e7/0x250
 process_one_work+0x1b0/0x350
 worker_thread+0x50/0x3a0
 kthread+0x124/0x150
 ret_from_fork+0x1f/0x30

The crash is caused by using k[v]free_rcu_mightsleep() functions, that were 
introduced by the faulty commit 5dc583481a0a ("Add kvfree_rcu_mightsleep() and 
kfree_rcu_mightsleep()").
This commit introduces new mightsleep functions but lacks critical 
infrastructure changes required for proper operation.
Our analysis indicates the root cause is an incomplete API migration, which 
causes mightsleep macros to pass void pointers where rcu_callback_t function 
pointers are expected:
BF5.15 (broken): void kvfree_call_rcu(struct rcu_head *head, rcu_callback_t 
func)
Required: void kvfree_call_rcu(struct rcu_head *head, void *ptr)
This results in invalid pointer arithmetic that generates tiny memory addresses 
(like 0x17) which crash the kernel when freed.

[Fix]
Phase 1 (Immediate):
Revert the problematic commit to restore stability, along with the two other 
commits from the same series:
* 57818f6fec6c ("can: gw: fix RCU/BH usage in cgw_create_job()")
* 5dc583481a0a ("rcu/kvfree: Add kvfree_rcu_mightsleep() and 
kfree_rcu_mightsleep()") (main problematic commit)
* 82683fabcb28 ("can: gw: use call_rcu() instead of costly synchronize_rcu()")

Phase 2 (Proper Implementation):
The results of our research should be verified and applied into Jammy to enable 
proper *_mightsleep() support for OFED driver.
The most critical commit to verify and apply is the upstream commit introducing 
the kvfree_call_rcu() signature transformation:
* 04a522b7da3d ("rcu: Refactor kvfree_call_rcu() and high-level helpers")
Additionally, the following commits should be examined to determine whether 
they are essential for avoiding future issues:
* 7e3f926bf453 ("rcu/kvfree: Eliminate k[v]free_rcu() single argument macro")
* 5da7cb193db3 ("rcu/kvfree: Avoid freeing new kfree_rcu() memory after old 
grace period")
* 23532061ad30 ("net/mlx5: Rename kfree_rcu() to kfree_rcu_mightsleep()")
A deeper investigation should also be conducted to ensure no additional crucial 
commits are required for proper integration of this feature into Jammy.
Once all necessary commits are backported, the *_mightsleep() functions can be 
safely re-introduced into Jammy.

[Test Case]
Phase 1:
After reverting the three commits mentioned above, the compilation completed 
successfully on the master-next branch.
After reverting, compiling the kernel, rebooting, building OFED and restarting 
the driver, no crash occurred.

Phase 2:
After applying all required infrastructure commits and re-adding mightsleep 
functions, system should remain stable when building OFED and restarting.

[Regression Potential]
Phase 1 (Revert): 
Very low risk. Simply removes the problematic new functionality and returns to 
the stable state that existed before the faulty commit.

Phase 2 (Proper implementation):
Medium risk as it requires backporting multiple upstream RCU infrastructure 
changes to an older kernel base.

** Affects: linux-bluefield (Ubuntu)
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2117163

Title:
  Jammy: Bluefield5.15 Kernel Crashes with kfree_rcu_mightsleep
  Functions

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-bluefield/+bug/2117163/+subscriptions


-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to