[I] Investigate adopting DataFusion's allocator-level memory accounting to replace manual memory tuning [datafusion-comet]

via GitHub Tue, 02 Jun 2026 15:30:45 -0700


andygrove opened a new issue, #4576:
URL: https://github.com/apache/datafusion-comet/issues/4576


   ## What is the problem the feature request solves?
   
   Comet's native memory accounting relies on DataFusion's `MemoryPool`, which 
is *voluntary*: it only counts bytes that an operator explicitly reserves via 
`reserve()` / `try_grow()`. Allocations that bypass the pool (Arrow buffers 
built during predicate evaluation, intermediate scratch in joins, allocations 
inside expression kernels) are invisible to the accounting. Comet's 
`CometUnifiedMemoryPool` forwards these reservations across JNI to Spark's 
`TaskMemoryManager`, so the undercount propagates up to Spark as well.
   
   This is a known, documented limitation. From the tuning guide:
   
   > Comet's memory accounting isn't 100% accurate and this can result in Comet 
using more memory than it reserves, leading to out-of-memory exceptions. To 
work around this issue, it is possible to set 
`spark.comet.exec.memoryPool.fraction` to a value less than `1.0`...
   
   `spark.comet.exec.memoryPool.fraction` is effectively a manual fudge factor 
that compensates for an unknown undercount margin, tuned per workload. We would 
like to move away from this manual and unreliable accounting toward a real, 
measured pool.
   
   The Comet roadmap already calls this out:
   
   > Improving memory accounting, reservation strategies, and spill integration 
will reduce out-of-memory errors and allow Comet to make better use of 
available resources, especially in multi-query and multi-task environments.
   
   ## Describe the potential solution
   
   DataFusion PR 
[apache/datafusion#22626](https://github.com/apache/datafusion/pull/22626) 
introduces an allocator-level accounting framework: a global 
`AccountingAllocator` that measures *actual* process heap allocations 
independent of voluntary pool reservations, plus an `AccountingMemoryPool` 
wrapper that compares real heap usage against tracked reservations. (The PR 
pairs the pool with a ~6x headroom factor, indicating the voluntary accounting 
undercounts substantially on the SLT suite.)
   
   Investigate whether and how Comet can adopt this:
   
   1. **Observability first.** Wrap `CometUnifiedMemoryPool` in an 
`AccountingMemoryPool` behind `spark.comet.debug.memory` to quantify the real 
divergence between reserved and actually-allocated bytes per workload. This 
directly tests whether the measured number could replace the 
`memoryPool.fraction` heuristic.
   2. **Honest reporting to Spark.** Evaluate reporting *measured* native heap 
(not just reservations) up to Spark's `TaskMemoryManager` so the unified memory 
manager has a true accounting of Comet's footprint.
   3. **Leak / untracked-allocation detection.** Use the allocator to flag 
Comet's custom operators and expression kernels that allocate without reserving.
   4. **Enforcement.** Determine how Comet should react to overdraft. Unlike 
the PR's `kill_on_overdraft` panic mechanism, Comet wants to spill / apply 
back-pressure rather than crash. Note `NativeMemoryConsumer.spill()` currently 
returns 0, so native-spill-driven-from-JVM is a prerequisite that must close in 
parallel.
   
   ## Additional context
   
   Open questions to resolve during investigation:
   
   - **Global allocator in a shared JVM.** A Spark executor runs many tasks 
concurrently in one process; the JVM heap and shaded Arrow-Java live in the 
same address space. A global Rust accounting allocator sees all Rust 
allocations (but nothing JVM-side) and aggregates across tasks. Comet would 
need to thread its task-attempt id into the allocator's thread-local context 
and reconcile this with `TASK_SHARED_MEMORY_POOLS` sharing. The PR stamps 
worker threads with a context id via `on_thread_start`, which maps reasonably 
onto Spark's per-task model.
   - **API stability.** PR #22626 predates significant churn in DataFusion's 
memory subsystem. The exact API surface (`AccountingAllocator`, 
`AccountingMemoryPool`, `MemoryPool::try_resize`, `kill_on_overdraft`) should 
be verified against current merged DataFusion source before building on it.
   - The ~6x headroom factor in the PR means adopting the allocator first makes 
the inaccuracy *visible and large*; the payoff comes from incrementally routing 
real allocations through accounting to drive that factor down.
   
   Related: depends on / overlaps with native spill integration work 
(`NativeMemoryConsumer.spill()`).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Investigate adopting DataFusion's allocator-level memory accounting to replace manual memory tuning [datafusion-comet]

Reply via email to