avantgardnerio opened a new pull request, #22776:
URL: https://github.com/apache/datafusion/pull/22776

   ## Which issue does this PR close?
   
   - Part of #22758 (memory accounting epic).
   - Addresses developer-ergonomics feedback raised by @alamb in 
https://github.com/apache/datafusion/issues/22723#issuecomment-4622423197.
   
   ## Rationale for this change
   
   @alamb's concern in #22723 was that a contributor working on something 
unrelated could hit `OverdraftPanic` and have no idea what to do — the message 
rendered as `allocator overdraft: account balance at panic = -1384887 bytes` 
and the underlying allocation was lost across spawn boundaries. The ask was: 
(1) self-explanatory error message with a remediation path, (2) per-test opt-in 
instead of a global headroom knob, (3) a written long-term plan.
   
   ## What changes are included in this PR?
   
   1. **Per-test opt-in for overdraft factor.** New SLT-only knob `SET 
datafusion.sqllogictest.memory_overdraft_factor = N` lifts the bank multiplier 
for the *next* `SET datafusion.runtime.memory_limit`, then auto-resets. The 
DataFusion parser doesn't know about this variable — the SLT runner intercepts 
it. No effect on `datafusion-cli` or anything else.
   
   2. **Chained panic hook (`install_overdraft_panic_hook`).** `OverdraftPanic` 
frequently fires inside spawned tokio tasks (e.g. RepartitionExec workers); 
tokio's task harness catches the unwind on the worker and the joiner converts 
it into `DataFusionError::External("task NN panicked")` — original payload 
lost. The hook runs on the worker *before* the harness eats it and writes the 
actionable message + forced backtrace to stderr. The runner's `catch_unwind` 
arm now reduces to a one-line pointer at the stderr.
   
   3. **`DEFAULT_BUDGET` defaults to `isize::MAX / 2` (effectively infinite).** 
An un-armed bank is now a no-op; real enforcement only kicks in once 
`--default-pool-size-mb` or `SET datafusion.runtime.memory_limit` arms the 
bank. Previously, file-setup allocations settled into a 0-budget bank and 
tripped before any real work ran, with a misleading backtrace pointing at 
`RuntimeEnv::default`.
   
   4. **Self-explanatory error text.** The stderr message now reads:
      > Memory accounting mismatch: this thread allocated N bytes more than 
DataFusion's MemoryPool accounted for. Some operator is allocating outside the 
pool's tracking — this is a real accounting bug worth fixing.
      >
      > If you made changes to an operator or UDF this is probably related to 
your work and should be investigated. If you do not believe this is related to 
your change, the fastest path forward is to opt the failing SLT into a larger 
overdraft tolerance and file the gap against the epic so we can pay it down:
      >
      >   `SET datafusion.sqllogictest.memory_overdraft_factor = N;`
      >
      > where N is roughly `2 * <bytes the query actually needs> / 
datafusion.runtime.memory_limit`. The override applies to the next `SET 
datafusion.runtime.memory_limit` only, then auto-resets.
      >
      > Please record the query + observed overshoot at:
      >   https://github.com/apache/datafusion/issues/22758
      >
      > stack backtrace: …
   
   5. **README documents the workflow** — when the message appears, how to opt 
out, why the default factor is loose (`8.0`), and the long-term plan to drive 
it toward `1.1` via #22758.
   
   ## Are these changes tested?
   
   - Unit tests cover `set_memory_overdraft_factor` overriding then resetting, 
and the SET-statement interceptor parsing bare / quoted / signed / bogus values.
   - End-to-end verification: tightening `nested_loop_join_spill.slt` to `SET 
datafusion.sqllogictest.memory_overdraft_factor = 1.0` (with 
`--default-pool-size-mb 16384`) trips the hook and produces a backtrace 
pointing at `nested_loop_join.rs:2156` → `arrow_arith::add_wrapping` → 
`Vec::collect` — a real untracked allocation worth filing as a sub-issue.
   - Test suite continues to pass with or without `--default-pool-size-mb` 
(default-pool-mb-absent is now a no-op rather than tripping spuriously during 
`RuntimeEnv::default`).
   
   ## Are there any user-facing changes?
   
   Only SLT-runner-facing:
   - New SLT variable `datafusion.sqllogictest.memory_overdraft_factor`.
   - New panic hook installed when running with `--features memory-accounting`.
   - README has a new sub-section on the developer workflow when the check 
fires.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to