force_merge with decoupled clear tracker

Arunpravin Paneer Selvam Wed, 10 Jun 2026 06:06:23 -0700



On 6/10/2026 2:49 PM, Matthew Auld wrote:

On 01/06/2026 11:51, Arunpravin Paneer Selvam wrote:
On 5/29/2026 11:11 PM, Matthew Auld wrote:
Hi,

On 27/05/2026 12:29, Arunpravin Paneer Selvam wrote:
The current buddy allocator maintains separate clear_tree[] and
dirty_tree[] rbtrees per order, preventing coalescing between cleared
and dirty buddies. Under mixed workloads, this creates a mergebarrier:adjacent buddies frequently end up split across trees, forcingreliance
on __force_merge() during allocation.
__force_merge() performs an O(N x max_order) scan under the VRAMmanager
lock, leading to allocation stalls and failures for large contiguous
requests even when sufficient total free memory is available.
So is this contig with non power-of-two sizes?
Both power-of-two and non-power-of-two contiguous requests areaffected - in either case, the required higher-order block can't formwhen its lower-order buddies are separated by clear/dirty stateacross the dual trees. But the core issue we are seeing is VRAMfragmentation caused by massive small allocations (e.g., thousands of4 KiB–8 KiB buffers) that end up split across clear and dirty trees,preventing buddy coalescing. This leads to allocation failures andOOM in later workloads even when sufficient total free VRAM isavailable.
Do we know if we could force_merge everything in one go or somehowbe more aggressive and do more than needed now, at the first sign ofcontention here, instead of doing it piecemeal? Downside would belosing more of the clear tracking, when this happens, but more re-merging.
Could we have another per-order list, of all blocks that we failedto merge, when we did the free step? When doing the force mergestep, we maybe don't need to search blindly and can focus instead onthe stuff tracked in those lists? Maybe it doesn't need to be alist, but could be another rb-tree?
We know the size of the total allocation, if we trigger force_merge,could we try to merge enough in one go for the entire allocation,instead of restarting the entire thing on the next iteration? Wouldthat help at all?
But I guess these are more for the stalling side, and won't helpmuch with the contig angle?
The memory is highly fragmented into mostly 4 KiB chunks and smallscattered blocks across the dual trees, so although total free memoryexists, it is split into low-order fragments. The workload thenrequests very large contiguous allocations (tens of GBs, e.g., ~64GiB), which fail with OOM because the allocator cannot formsufficiently large high- order blocks from the fragmented space. Wecould go with more aggressive merging or merge-in-one-go approaches,but this might waste more cleared memory. I think fundamentally thebuddy allocator should be allowed to merge unconditionally - thesingle-tree approach with unconditional coalescing would improve thefragmentation and benefit contiguous allocations along withaddressing the stalling and latency issues.
For the extent idea, is there any merit in maybe doing this for allcontig blobs, and not just cleared stuff? Or is the workload you areseeing only benefit users that want cleared stuff? Wondering if thiswould benefit all users that want contig? Like if we hypotheticallykept clear and dirty separate, like we do now, but with an improvedforce_merge, and then have extent tracking for all contig blobs andreplace the try_harder stuff? When you do a contig alloc, theindividual clear/dirty is still all there within the range, so youcan skip re-clearing in some cases. I guess downside is overall morefuzzy contig + clear/free path, but I guess you would never getallocation failures, when there is sufficient contig space?
Yes, extending extent tracking to all contig allocations has merit,but the core problem remains - with the dual-tree design, we stillneed force_merge to undo the clear/dirty split before those extentscan form. In cases like heavy small-allocation workloads (thousandsof 4 KiB buffers) running first, the memory ends up massivelyfragmented across both trees. When a very large contiguous allocation(e.g., ~64 GiB) comes in later, the allocator fails with OOM eventhough sufficient total free memory exists, because the extenttracker can't find a contiguous range that was never allowed to mergein the first place. I think the dirty/clear split is fundamentallythe problem - allowing the buddy allocator to merge unconditionallyremoves this barrier, and the clear tracker can then be layered ontop as an optimization without blocking coalescing.
Solution

Replace the dual-tree design with:
- A single free_tree[order] rbtree for dirty and mixed free blocks
   (fully cleared free blocks float outside this tree)
- A lightweight out-of-band clear tracker (gpu_clear_tracker)

Fully cleared free blocks are tracked outside the buddy trees using an
augmented interval rbtree, enabling O(log E) lookup of the largest
cleared extents.
Buddy coalescing is now unconditional in __gpu_buddy_free(),regardlessof clear/dirty state. This removes the merge barrier and eliminatesthe
need for __force_merge().

Benefits

- Correct high-order allocations after mixed clear/dirty workloads
- Elimination of O(N x max_order) merge cost from the allocation path
- O(log E) cleared-extent lookup replacing O(N) scans
- Predictable allocation latency under fragmentation
- Reduced complexity with a single tree per order
Since there is no separate tracking for dirty stuff, is the non-cleared alloc path a bit more "fuzzy" now, with it potentiallystealing cleared memory, or is it the same behaviour still?
Right, on v4, the dirty and mixed (partially cleared) blocks areallocated for the non-cleared alloc path, which can end up stealingcleared memory. On v5, I plan to address this with a three-tier dirtyallocation fallback: dirty → mixed → clear, driven by rbtree augmentbits (subtree_has_dirty, subtree_has_mixed), each pass O(log N). Thesplit-descent also applies the same preference at every level whencarving a higher-order block, so cleared memory is preserved as muchas possible and only used as a last resort.
Thoughts ?
No objections from me. Do you want me to still look at v4 in depth, orwait for v5? I only really looked at this from high level.

I will send the v5. Please review the next version.

Thanks,
Arun.

For drivers that don't use free tracking, is there some benefit? Arethere any downsides there? I assume that clear tracker is always empty.
Correct, for drivers that don't clear memory, the clear tracker isalways empty and they simply allocate from the free_tree[]. Benefits:
Single tree per order instead of dual trees (fewer rbtree operations)
No force_merge path at all (unconditional coalescing at free time)
Simpler code path overall
No real downsides - the clear tracker adds zero overhead when empty,and the augment bits would simply show all blocks as dirty, so thewalk degenerates to a normal rbtree lookup with no extra cost.
Regards,
Arun.

Re: [PATCH v4 1/2] gpu/buddy: replace dual-tree/force_merge with decoupled clear tracker

Reply via email to