mbutrovich commented on code in PR #21629:
URL: https://github.com/apache/datafusion/pull/21629#discussion_r3081924897
##########
datafusion/common/src/config.rs:
##########
@@ -555,7 +555,17 @@ config_namespace! {
/// When sorting, below what size should data be concatenated
/// and sorted in a single RecordBatch rather than sorted in
/// batches and merged.
- pub sort_in_place_threshold_bytes: usize, default = 1024 * 1024
+ ///
+ /// Deprecated: this option is no longer used. The sort pipeline
+ /// now always coalesces batches before sorting. Use
+ /// `sort_coalesce_target_rows` instead.
+ pub sort_in_place_threshold_bytes: usize, warn =
"`sort_in_place_threshold_bytes` is deprecated and ignored. Use
`sort_coalesce_target_rows` instead.", default = 1024 * 1024
+
+ /// Target number of rows to coalesce before sorting in ExternalSorter.
+ ///
+ /// Larger values reduce merge fan-in by producing fewer, larger
+ /// sorted runs.
+ pub sort_coalesce_target_rows: usize, default = 32768
Review Comment:
Yeah I suspect we'd want to do a good sensitivity analysis on different
types and batch sizes for `lexsort_to_indices` (and eventually the radix sort
kernel). We might hit a point of diminishing returns/cache friendliness if our
coalesced batches get too large.
This design also first spills from the sorted runs, so holding more unsorted
rows in the coalescer may make it more likely for us to trigger spilling.
I'm definitely of the mind that we can and should tune this, but unclear
what even a reasonable default right now would be. In Comet where we run TPC-H
SF 1000, for example, I suspect we'll want longer sorted runs.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]