andygrove opened a new pull request, #1904:
URL: https://github.com/apache/datafusion-ballista/pull/1904
# Which issue does this PR close?
Closes #1679.
# Rationale for this change
The static `DefaultDistributedPlanner` can broadcast a small build side
(`HashJoinExec(Partitioned)` → `CollectLeft`, lowered as a broadcast
shuffle),
but only for `HashJoinExec`. Ballista defaults `prefer_hash_join = false`, so
TPC-H joins are planned as `SortMergeJoinExec` and broadcast never fires on
the
default (non-AQE) path: small dimension tables get a full hash shuffle and
force
the large side to reshuffle on the join key.
A `SortMergeJoinExec` has no `CollectLeft` mode, so when a side is small
enough
this PR converts it to a `HashJoinExec(CollectLeft)` (the build side fits in
memory by definition, so the no-spill reason Ballista prefers sort-merge
does not
apply) and reuses the existing broadcast lowering. This mirrors what the AQE
resolver already does for sort-merge joins.
# What changes are included in this PR?
- New config `ballista.optimizer.broadcast_sort_merge_join_enabled`
(default `false`).
- `maybe_promote_to_broadcast` converts a small-side `SortMergeJoinExec` to a
`HashJoinExec(Partitioned)` (dropping the now-redundant input `SortExec`)
and
runs it through the existing threshold/swap/promotion path. Handles the
`ProjectionExec` that `swap_inputs` inserts to preserve column order.
- Tests: config default; SMJ promoted to a broadcast `CollectLeft` with sorts
stripped; SMJ left unchanged when the flag is off or no side is under the
threshold.
# Are there any user-facing changes?
Yes — a new opt-in config. With it enabled, a `SortMergeJoinExec` whose build
side fits under `broadcast_join_threshold_bytes` is executed as a broadcast
(`CollectLeft`) hash join instead of two hash-partitioned shuffles. No SQL
semantics change.
TPC-H SF10 (AQE off, 8 partitions): ~16% faster across the 22-query suite
(q8 1.6×, q11 1.5×, q2 1.4×, q17 1.4×, q9 1.3×), with identical row counts.
A follow-up can also strip the join-key `RepartitionExec` from the converted
inputs (build broadcast from natural partitions, probe kept in place) for a
further gain; the existing hash-join broadcast path shares that limitation.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]