mbutrovich commented on PR #21525: URL: https://github.com/apache/datafusion/pull/21525#issuecomment-4218416280
Benchmark results show no improvement, which caused me to look at the sorting state machine a bit more. `ExternalSorter` only concats batches into a single large sort when total buffered data is < 1MB (`sort_in_place_threshold_bytes`). Above that, it sorts each small batch (1024–8192 rows) individually and merges them. Radix sort needs a large batch to amortize `RowConverter` encoding overhead, but at small batch sizes the encoding cost dominates any radix advantage. We can't simply raise the threshold because `concat_batches` temporarily 2x's memory, and `RowConverter` adds another large allocation on top. There's also no arrow-rs API to incrementally append to a `Rows` buffer across batches, so we can't amortize encoding as data arrives. Shelving the DataFusion integration and focusing on landing the arrow-rs kernel for now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
