andygrove opened a new pull request, #1651:
URL: https://github.com/apache/datafusion-ballista/pull/1651

   ## Which issue does this PR close?
   
   Closes #1648.
   
   ## Rationale for this change
   
   DataFusion's hash join has no spill support, so the build side must fit in 
memory per task. Ballista executors run multiple tasks per host, so per-task 
build sides aggregate and OOM the executor under realistic load — for example, 
the integration test suite (`./dev/integration-tests.sh`) does not complete on 
a typical machine without disabling hash joins.
   
   DataFusion sets `datafusion.optimizer.prefer_hash_join = true`, which is 
fine for a single-node engine but a poor fit for a distributed multi-task one. 
This PR flips the Ballista default to sort-merge join, which spills, and leaves 
users a session-level knob to opt back into hash join when they know the build 
side fits.
   
   ## What changes are included in this PR?
   
   - Set `datafusion.optimizer.prefer_hash_join = false` inside 
`ballista_restricted_configuration` (`ballista/core/src/extension.rs`), 
alongside the other soft DataFusion-default overrides. This is a soft default — 
users can override per session.
   - Update the existing `should_support_sort_merge_join` integration test to 
assert the default picks `SortMergeJoinExec` via EXPLAIN, instead of relying on 
a result-row check that would pass under either join algorithm.
   - Add `should_support_hash_join_when_opted_in` to verify users can still get 
`HashJoinExec` after `SET datafusion.optimizer.prefer_hash_join = true`.
   - Document the new default and the opt-in path in 
`docs/source/user-guide/tuning-guide.md` under a new "Join Strategy" subsection.
   
   ## Are there any user-facing changes?
   
   Yes — behaviour change. Queries that previously planned a `HashJoinExec` 
will now plan a `SortMergeJoinExec` by default. The old behaviour is one `SET` 
away. Documented in the tuning guide. No public Rust API change, so no `api 
change` label.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to