mbutrovich commented on PR #21525:
URL: https://github.com/apache/datafusion/pull/21525#issuecomment-4218416280

   Benchmark results show no improvement, which caused me to look at the 
sorting state machine a bit more.
   
   `ExternalSorter` only concats batches into a single large sort when total 
buffered data is < 1MB (`sort_in_place_threshold_bytes`). Above that, it sorts 
each small batch (1024–8192 rows) individually and merges them. Radix sort 
needs a large batch to amortize `RowConverter` encoding overhead, but at small 
batch sizes the encoding cost dominates any radix advantage.
   
   We can't simply raise the threshold because `concat_batches` temporarily 
2x's memory, and `RowConverter` adds another large allocation on top. There's 
also no arrow-rs API to incrementally append to a `Rows` buffer across batches, 
so we can't amortize encoding as data arrives.
   
   Shelving the DataFusion integration and focusing on landing the arrow-rs 
kernel for now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to