Dandandan commented on code in PR #21629:
URL: https://github.com/apache/datafusion/pull/21629#discussion_r3081762076


##########
datafusion/common/src/config.rs:
##########
@@ -555,7 +555,17 @@ config_namespace! {
         /// When sorting, below what size should data be concatenated
         /// and sorted in a single RecordBatch rather than sorted in
         /// batches and merged.
-        pub sort_in_place_threshold_bytes: usize, default = 1024 * 1024
+        ///
+        /// Deprecated: this option is no longer used. The sort pipeline
+        /// now always coalesces batches before sorting. Use
+        /// `sort_coalesce_target_rows` instead.
+        pub sort_in_place_threshold_bytes: usize, warn = 
"`sort_in_place_threshold_bytes` is deprecated and ignored. Use 
`sort_coalesce_target_rows` instead.", default = 1024 * 1024
+
+        /// Target number of rows to coalesce before sorting in ExternalSorter.
+        ///
+        /// Larger values reduce merge fan-in by producing fewer, larger
+        /// sorted runs.
+        pub sort_coalesce_target_rows: usize, default = 32768

Review Comment:
   I wonder if we can make this somewhat adaptive: as we usually load 
everything in memory, it seems for very large sets larger batches would be even 
more favorable (e.g. use 10MiB "scratch space" for coalescing instead of 32KiB 
rows would make sense if our data is 1GiB and perhaps be even faster?)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to