andygrove opened a new pull request, #1904:
URL: https://github.com/apache/datafusion-ballista/pull/1904

   # Which issue does this PR close?
   
   Closes #1679.
   
   # Rationale for this change
   
   The static `DefaultDistributedPlanner` can broadcast a small build side
   (`HashJoinExec(Partitioned)` → `CollectLeft`, lowered as a broadcast 
shuffle),
   but only for `HashJoinExec`. Ballista defaults `prefer_hash_join = false`, so
   TPC-H joins are planned as `SortMergeJoinExec` and broadcast never fires on 
the
   default (non-AQE) path: small dimension tables get a full hash shuffle and 
force
   the large side to reshuffle on the join key.
   
   A `SortMergeJoinExec` has no `CollectLeft` mode, so when a side is small 
enough
   this PR converts it to a `HashJoinExec(CollectLeft)` (the build side fits in
   memory by definition, so the no-spill reason Ballista prefers sort-merge 
does not
   apply) and reuses the existing broadcast lowering. This mirrors what the AQE
   resolver already does for sort-merge joins.
   
   # What changes are included in this PR?
   
   - New config `ballista.optimizer.broadcast_sort_merge_join_enabled`
     (default `false`).
   - `maybe_promote_to_broadcast` converts a small-side `SortMergeJoinExec` to a
     `HashJoinExec(Partitioned)` (dropping the now-redundant input `SortExec`) 
and
     runs it through the existing threshold/swap/promotion path. Handles the
     `ProjectionExec` that `swap_inputs` inserts to preserve column order.
   - Tests: config default; SMJ promoted to a broadcast `CollectLeft` with sorts
     stripped; SMJ left unchanged when the flag is off or no side is under the
     threshold.
   
   # Are there any user-facing changes?
   
   Yes — a new opt-in config. With it enabled, a `SortMergeJoinExec` whose build
   side fits under `broadcast_join_threshold_bytes` is executed as a broadcast
   (`CollectLeft`) hash join instead of two hash-partitioned shuffles. No SQL
   semantics change.
   
   TPC-H SF10 (AQE off, 8 partitions): ~16% faster across the 22-query suite
   (q8 1.6×, q11 1.5×, q2 1.4×, q17 1.4×, q9 1.3×), with identical row counts.
   
   A follow-up can also strip the join-key `RepartitionExec` from the converted
   inputs (build broadcast from natural partitions, probe kept in place) for a
   further gain; the existing hash-join broadcast path shares that limitation.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to