andygrove opened a new pull request, #1604: URL: https://github.com/apache/datafusion-ballista/pull/1604
# Which issue does this PR close? Closes #. # Rationale for this change The sort-based shuffle writer (added recently and documented in #1595) writes `2 x N` consolidated files with an index instead of the hash-based writer's `N x M` files, coalesces small batches, and bounds shuffle memory via spill. For most workloads, especially those with high partition fan-out, this is the better default. Now that it has integration test coverage and documentation in the tuning guide, it makes sense to make it the default for new sessions while keeping the hash-based writer reachable via configuration. # What changes are included in this PR? - Flip the default of `ballista.shuffle.sort_based.enabled` from `false` to `true` in `ballista/core/src/config.rs`. - Update `docs/source/user-guide/configs.md` to show the new default and a note about falling back to the hash writer. - Restructure the "Shuffle Implementation" section of `docs/source/user-guide/tuning-guide.md` so sort-based is presented as the default and hash-based as the opt-in fallback. # Are there any user-facing changes? Yes. New sessions will use the sort-based shuffle writer by default. Users who want the previous behavior can set `ballista.shuffle.sort_based.enabled=false` per session. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
