andygrove opened a new pull request, #1651: URL: https://github.com/apache/datafusion-ballista/pull/1651
## Which issue does this PR close? Closes #1648. ## Rationale for this change DataFusion's hash join has no spill support, so the build side must fit in memory per task. Ballista executors run multiple tasks per host, so per-task build sides aggregate and OOM the executor under realistic load — for example, the integration test suite (`./dev/integration-tests.sh`) does not complete on a typical machine without disabling hash joins. DataFusion sets `datafusion.optimizer.prefer_hash_join = true`, which is fine for a single-node engine but a poor fit for a distributed multi-task one. This PR flips the Ballista default to sort-merge join, which spills, and leaves users a session-level knob to opt back into hash join when they know the build side fits. ## What changes are included in this PR? - Set `datafusion.optimizer.prefer_hash_join = false` inside `ballista_restricted_configuration` (`ballista/core/src/extension.rs`), alongside the other soft DataFusion-default overrides. This is a soft default — users can override per session. - Update the existing `should_support_sort_merge_join` integration test to assert the default picks `SortMergeJoinExec` via EXPLAIN, instead of relying on a result-row check that would pass under either join algorithm. - Add `should_support_hash_join_when_opted_in` to verify users can still get `HashJoinExec` after `SET datafusion.optimizer.prefer_hash_join = true`. - Document the new default and the opt-in path in `docs/source/user-guide/tuning-guide.md` under a new "Join Strategy" subsection. ## Are there any user-facing changes? Yes — behaviour change. Queries that previously planned a `HashJoinExec` will now plan a `SortMergeJoinExec` by default. The old behaviour is one `SET` away. Documented in the tuning guide. No public Rust API change, so no `api change` label. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
