On 11/3/25 04:17, Harry Yoo wrote: > On Fri, Oct 31, 2025 at 10:32:54PM +0100, Daniel Gomez wrote: >> >> >> On 10/09/2025 10.01, Vlastimil Babka wrote: >> > Extend the sheaf infrastructure for more efficient kfree_rcu() handling. >> > For caches with sheaves, on each cpu maintain a rcu_free sheaf in >> > addition to main and spare sheaves. >> > >> > kfree_rcu() operations will try to put objects on this sheaf. Once full, >> > the sheaf is detached and submitted to call_rcu() with a handler that >> > will try to put it in the barn, or flush to slab pages using bulk free, >> > when the barn is full. Then a new empty sheaf must be obtained to put >> > more objects there. >> > >> > It's possible that no free sheaves are available to use for a new >> > rcu_free sheaf, and the allocation in kfree_rcu() context can only use >> > GFP_NOWAIT and thus may fail. In that case, fall back to the existing >> > kfree_rcu() implementation. >> > >> > Expected advantages: >> > - batching the kfree_rcu() operations, that could eventually replace the >> > existing batching >> > - sheaves can be reused for allocations via barn instead of being >> > flushed to slabs, which is more efficient >> > - this includes cases where only some cpus are allowed to process rcu >> > callbacks (Android) >> > >> > Possible disadvantage: >> > - objects might be waiting for more than their grace period (it is >> > determined by the last object freed into the sheaf), increasing memory >> > usage - but the existing batching does that too. >> > >> > Only implement this for CONFIG_KVFREE_RCU_BATCHED as the tiny >> > implementation favors smaller memory footprint over performance. >> > >> > Also for now skip the usage of rcu sheaf for CONFIG_PREEMPT_RT as the >> > contexts where kfree_rcu() is called might not be compatible with taking >> > a barn spinlock or a GFP_NOWAIT allocation of a new sheaf taking a >> > spinlock - the current kfree_rcu() implementation avoids doing that. >> > >> > Teach kvfree_rcu_barrier() to flush all rcu_free sheaves from all caches >> > that have them. This is not a cheap operation, but the barrier usage is >> > rare - currently kmem_cache_destroy() or on module unload. >> > >> > Add CONFIG_SLUB_STATS counters free_rcu_sheaf and free_rcu_sheaf_fail to >> > count how many kfree_rcu() used the rcu_free sheaf successfully and how >> > many had to fall back to the existing implementation. >> > >> > Signed-off-by: Vlastimil Babka <[email protected]> >> >> Hi Vlastimil, >> >> This patch increases kmod selftest (stress module loader) runtime by about >> ~50-60%, from ~200s to ~300s total execution time. My tested kernel has >> CONFIG_KVFREE_RCU_BATCHED enabled. Any idea or suggestions on what might be >> causing this, or how to address it? > > This is likely due to increased kvfree_rcu_barrier() during module unload.
Hm so there are actually two possible sources of this. One is that the module creates some kmem_cache and calls kmem_cache_destroy() on it before unloading. That does kvfree_rcu_barrier() which iterates all caches via flush_all_rcu_sheaves(), but in this case it shouldn't need to - we could have a weaker form of kvfree_rcu_barrier() that only guarantees flushing of that single cache. The other source is codetag_unload_module(), and I'm afraid it's this one as it's hooked to evey module unload. Do you have CONFIG_CODE_TAGGING enabled? Disabling it should help in this case, if you don't need memory allocation profiling for that stress test. I think there's some space for improvement - when compiled in but memalloc profiling never enabled during the uptime, this could probably be skipped? Suren? > It currently iterates over all CPUs x slab caches (that enabled sheaves, > there should be only a few now) pair to make sure rcu sheaf is flushed > by the time kvfree_rcu_barrier() returns. Yeah, also it's done under slab_mutex. Is the stress test trying to unload multiple modules in parallel? That would make things worse, although I'd expect there's a lot serialization in this area already. Unfortunately it will get worse with sheaves extended to all caches. We could probably mark caches once they allocate their first rcu_free sheaf (should not add visible overhead) and keep skipping those that never did. > Just being curious, do you have any serious workload that depends on > the performance of module unload?

