andygrove opened a new pull request, #1604:
URL: https://github.com/apache/datafusion-ballista/pull/1604

   # Which issue does this PR close?
   
   Closes #.
   
   # Rationale for this change
   
   The sort-based shuffle writer (added recently and documented in #1595) 
writes `2 x N` consolidated files with an index instead of the hash-based 
writer's `N x M` files, coalesces small batches, and bounds shuffle memory via 
spill. For most workloads, especially those with high partition fan-out, this 
is the better default. Now that it has integration test coverage and 
documentation in the tuning guide, it makes sense to make it the default for 
new sessions while keeping the hash-based writer reachable via configuration.
   
   # What changes are included in this PR?
   
   - Flip the default of `ballista.shuffle.sort_based.enabled` from `false` to 
`true` in `ballista/core/src/config.rs`.
   - Update `docs/source/user-guide/configs.md` to show the new default and a 
note about falling back to the hash writer.
   - Restructure the "Shuffle Implementation" section of 
`docs/source/user-guide/tuning-guide.md` so sort-based is presented as the 
default and hash-based as the opt-in fallback.
   
   # Are there any user-facing changes?
   
   Yes. New sessions will use the sort-based shuffle writer by default. Users 
who want the previous behavior can set 
`ballista.shuffle.sort_based.enabled=false` per session.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to