andygrove opened a new pull request, #4186:
URL: https://github.com/apache/datafusion-comet/pull/4186
## Which issue does this PR close?
Closes #.
## Rationale for this change
Forcing every `SortMergeJoinExec` to be rewritten as `ShuffledHashJoinExec`
(via `spark.comet.exec.replaceSortMergeJoin=true`) can OOM on large joins
because Comet's native `HashJoinExec` cannot spill its hash table. The rule
previously had no size-based safety net, so enabling it on queries with
multi-GB build sides (e.g. TPC-H q9's `lineitem` joins) aborted the stage.
## What changes are included in this PR?
- `RewriteJoin` consults a per-join-side build-size budget before replacing
a `SortMergeJoinExec` with a `ShuffledHashJoinExec`. Joins whose build
side `stats.sizeInBytes` exceeds the budget are kept as SMJ.
- The budget is either explicit (`maxBuildSize`) or derived from Spark conf:
`offHeap.size / executor.cores * memoryFraction / hashTableOverhead`.
- Three new configs under `spark.comet.exec.replaceSortMergeJoin.*`:
`maxBuildSize` (absolute cap, `0` = auto-derive, `-1` = disable),
`memoryFraction` (default `0.25`), and `hashTableOverhead` (default `3.0`).
- Rule rejections emit a `withInfo` message naming the sizes and configs so
users can see why a join was not rewritten.
## How are these changes tested?
- New `RewriteJoinSuite` covers the budget computation: explicit-value,
auto-derive, disable-check, and scaling with `memoryFraction` /
`hashTableOverhead`.
- End-to-end rewrite correctness is already covered by `CometJoinSuite` and
`CometExecSuite`, which were rerun locally with `replaceSortMergeJoin=true`
and the new defaults, all passing.
- `RewriteJoinSuite` added to the `exec` suite lists in the Linux and macOS
PR build workflows.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]