andygrove opened a new issue, #4576: URL: https://github.com/apache/datafusion-comet/issues/4576
## What is the problem the feature request solves? Comet's native memory accounting relies on DataFusion's `MemoryPool`, which is *voluntary*: it only counts bytes that an operator explicitly reserves via `reserve()` / `try_grow()`. Allocations that bypass the pool (Arrow buffers built during predicate evaluation, intermediate scratch in joins, allocations inside expression kernels) are invisible to the accounting. Comet's `CometUnifiedMemoryPool` forwards these reservations across JNI to Spark's `TaskMemoryManager`, so the undercount propagates up to Spark as well. This is a known, documented limitation. From the tuning guide: > Comet's memory accounting isn't 100% accurate and this can result in Comet using more memory than it reserves, leading to out-of-memory exceptions. To work around this issue, it is possible to set `spark.comet.exec.memoryPool.fraction` to a value less than `1.0`... `spark.comet.exec.memoryPool.fraction` is effectively a manual fudge factor that compensates for an unknown undercount margin, tuned per workload. We would like to move away from this manual and unreliable accounting toward a real, measured pool. The Comet roadmap already calls this out: > Improving memory accounting, reservation strategies, and spill integration will reduce out-of-memory errors and allow Comet to make better use of available resources, especially in multi-query and multi-task environments. ## Describe the potential solution DataFusion PR [apache/datafusion#22626](https://github.com/apache/datafusion/pull/22626) introduces an allocator-level accounting framework: a global `AccountingAllocator` that measures *actual* process heap allocations independent of voluntary pool reservations, plus an `AccountingMemoryPool` wrapper that compares real heap usage against tracked reservations. (The PR pairs the pool with a ~6x headroom factor, indicating the voluntary accounting undercounts substantially on the SLT suite.) Investigate whether and how Comet can adopt this: 1. **Observability first.** Wrap `CometUnifiedMemoryPool` in an `AccountingMemoryPool` behind `spark.comet.debug.memory` to quantify the real divergence between reserved and actually-allocated bytes per workload. This directly tests whether the measured number could replace the `memoryPool.fraction` heuristic. 2. **Honest reporting to Spark.** Evaluate reporting *measured* native heap (not just reservations) up to Spark's `TaskMemoryManager` so the unified memory manager has a true accounting of Comet's footprint. 3. **Leak / untracked-allocation detection.** Use the allocator to flag Comet's custom operators and expression kernels that allocate without reserving. 4. **Enforcement.** Determine how Comet should react to overdraft. Unlike the PR's `kill_on_overdraft` panic mechanism, Comet wants to spill / apply back-pressure rather than crash. Note `NativeMemoryConsumer.spill()` currently returns 0, so native-spill-driven-from-JVM is a prerequisite that must close in parallel. ## Additional context Open questions to resolve during investigation: - **Global allocator in a shared JVM.** A Spark executor runs many tasks concurrently in one process; the JVM heap and shaded Arrow-Java live in the same address space. A global Rust accounting allocator sees all Rust allocations (but nothing JVM-side) and aggregates across tasks. Comet would need to thread its task-attempt id into the allocator's thread-local context and reconcile this with `TASK_SHARED_MEMORY_POOLS` sharing. The PR stamps worker threads with a context id via `on_thread_start`, which maps reasonably onto Spark's per-task model. - **API stability.** PR #22626 predates significant churn in DataFusion's memory subsystem. The exact API surface (`AccountingAllocator`, `AccountingMemoryPool`, `MemoryPool::try_resize`, `kill_on_overdraft`) should be verified against current merged DataFusion source before building on it. - The ~6x headroom factor in the PR means adopting the allocator first makes the inaccuracy *visible and large*; the payoff comes from incrementally routing real allocations through accounting to drive that factor down. Related: depends on / overlaps with native spill integration work (`NativeMemoryConsumer.spill()`). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
