avantgardnerio opened a new pull request, #22776: URL: https://github.com/apache/datafusion/pull/22776
## Which issue does this PR close? - Part of #22758 (memory accounting epic). - Addresses developer-ergonomics feedback raised by @alamb in https://github.com/apache/datafusion/issues/22723#issuecomment-4622423197. ## Rationale for this change @alamb's concern in #22723 was that a contributor working on something unrelated could hit `OverdraftPanic` and have no idea what to do — the message rendered as `allocator overdraft: account balance at panic = -1384887 bytes` and the underlying allocation was lost across spawn boundaries. The ask was: (1) self-explanatory error message with a remediation path, (2) per-test opt-in instead of a global headroom knob, (3) a written long-term plan. ## What changes are included in this PR? 1. **Per-test opt-in for overdraft factor.** New SLT-only knob `SET datafusion.sqllogictest.memory_overdraft_factor = N` lifts the bank multiplier for the *next* `SET datafusion.runtime.memory_limit`, then auto-resets. The DataFusion parser doesn't know about this variable — the SLT runner intercepts it. No effect on `datafusion-cli` or anything else. 2. **Chained panic hook (`install_overdraft_panic_hook`).** `OverdraftPanic` frequently fires inside spawned tokio tasks (e.g. RepartitionExec workers); tokio's task harness catches the unwind on the worker and the joiner converts it into `DataFusionError::External("task NN panicked")` — original payload lost. The hook runs on the worker *before* the harness eats it and writes the actionable message + forced backtrace to stderr. The runner's `catch_unwind` arm now reduces to a one-line pointer at the stderr. 3. **`DEFAULT_BUDGET` defaults to `isize::MAX / 2` (effectively infinite).** An un-armed bank is now a no-op; real enforcement only kicks in once `--default-pool-size-mb` or `SET datafusion.runtime.memory_limit` arms the bank. Previously, file-setup allocations settled into a 0-budget bank and tripped before any real work ran, with a misleading backtrace pointing at `RuntimeEnv::default`. 4. **Self-explanatory error text.** The stderr message now reads: > Memory accounting mismatch: this thread allocated N bytes more than DataFusion's MemoryPool accounted for. Some operator is allocating outside the pool's tracking — this is a real accounting bug worth fixing. > > If you made changes to an operator or UDF this is probably related to your work and should be investigated. If you do not believe this is related to your change, the fastest path forward is to opt the failing SLT into a larger overdraft tolerance and file the gap against the epic so we can pay it down: > > `SET datafusion.sqllogictest.memory_overdraft_factor = N;` > > where N is roughly `2 * <bytes the query actually needs> / datafusion.runtime.memory_limit`. The override applies to the next `SET datafusion.runtime.memory_limit` only, then auto-resets. > > Please record the query + observed overshoot at: > https://github.com/apache/datafusion/issues/22758 > > stack backtrace: … 5. **README documents the workflow** — when the message appears, how to opt out, why the default factor is loose (`8.0`), and the long-term plan to drive it toward `1.1` via #22758. ## Are these changes tested? - Unit tests cover `set_memory_overdraft_factor` overriding then resetting, and the SET-statement interceptor parsing bare / quoted / signed / bogus values. - End-to-end verification: tightening `nested_loop_join_spill.slt` to `SET datafusion.sqllogictest.memory_overdraft_factor = 1.0` (with `--default-pool-size-mb 16384`) trips the hook and produces a backtrace pointing at `nested_loop_join.rs:2156` → `arrow_arith::add_wrapping` → `Vec::collect` — a real untracked allocation worth filing as a sub-issue. - Test suite continues to pass with or without `--default-pool-size-mb` (default-pool-mb-absent is now a no-op rather than tripping spuriously during `RuntimeEnv::default`). ## Are there any user-facing changes? Only SLT-runner-facing: - New SLT variable `datafusion.sqllogictest.memory_overdraft_factor`. - New panic hook installed when running with `--features memory-accounting`. - README has a new sub-section on the developer workflow when the check fires. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
