Re: [I] NestedLoopJoinExec spill path: untracked allocation overshoots memory pool [datafusion]

via GitHub Wed, 03 Jun 2026 19:29:57 -0700


2010YOUY01 commented on issue #22723:
URL: https://github.com/apache/datafusion/issues/22723#issuecomment-4618433648


   > [@2010YOUY01](https://github.com/2010YOUY01) , re: issue 1 - PTAL at the 
actual [issue for the DataFusion 51 
regression](https://github.com/apache/datafusion/issues/22739) that motivated 
this whole path. I should have lead with that, and I think it answers a few of 
your questions. Instead of doing that, I just cranked `HEADROOM` down to `5` 
and filed a bug for the first thing it hit (NLJ).
   > 
   > Addressing your points (as I perceive them) not addressed by the above:
   > 
   > 1. This conflates integration tests with memory accounting tests. i.e. the 
operator is at fault, not a random SLT. I would argue that most of our coverage 
is through SLTs so adding it here buys regression testing "for free". Normally 
that might be bad, because when an integration test breaks it's hard to debug, 
but this tracker usually points to the exact call stack of the bug. I think 
accounting regressions are real regressions and thus belong in the regression 
suite.
   > 2. "small memory-limited queries" - I think the question is twofold: 1. 
what is "small"? and 2. how does it scale? 700K in a test might be "small", but 
if that multiplies out in a prod environment to be 7GB, then it's big. A 
possible solution: we can define both the existing relative `HEADROOM` and an 
`ABS_HEADROOM` and it doesn't fire unless it crosses both. I would lightly push 
back though because that means we specifically need to run all these with large 
amounts of data, which means we don't get the benefit of the existing suite.
   > 
   > I think your other concerns (difficulty of tracking, size required to 
track, etc) are all addressed in the other issue. I hope I was able to give a 
proper response after I put in proper effort.
   
   Thanks for raising these points. I agree this is an important issue to 
solve, but I think the proposed approach may be difficult to maintain with the 
current memory accounting model. Here are a few thoughts on: 
   1. why it does not work reliably today
   2. what we might need before it can work well
   
   - The current operator-level memory accounting is intentionally incomplete. 
It mostly tracks buffers that can grow without a fixed bound, such as 
aggregation hash tables. It generally treats one in-flight batch per operator 
per partition as constant overhead and leaves it untracked. This can still 
become large when the operator pipeline is long or when the target partition 
count is high. Your validation approach would be valid if all of this memory 
were accounted for, but today's implementation trades completeness for 
simplicity.
   
   - If we do not fix that first, the `HEADROOM` + `ABS_HEADROOM` model is 
wrong. It could produce false positives and make the tests hard to maintain.
     - Given this specific bug in the issue, we have to first understand all 
low level detail in all related operators, then do estimation to figure out if 
it's false positive or not. Below is the extra investigation to do for bug 
report triage on this issue, I believe this is very hard to do.
    ```
   Memory limit is 150K, measured allocation 770K
   - generate_series holds one temporary batch: 8K rows * 8 bytes = 64KB
   - Current probe side batch: 64KB
   - NLJ holds a batch for evaluating the filter (v1, v2, 0), plus a result 
batch: roughly 8K rows * 8 bytes * 4
   - All buffered build side batch: 150KB
   
   --> Corrected memory estimation: 534KB
   ```
   
   - I agree that catching regressions in SLTs is valuable. But given the 
assumptions above, the easiest thing we can do for now may be to compare memory 
usage between commits. That would still let us detect regressions when a PR 
accidentally increases memory allocation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] NestedLoopJoinExec spill path: untracked allocation overshoots memory pool [datafusion]

Reply via email to