2010YOUY01 commented on issue #21381:
URL: https://github.com/apache/datafusion/issues/21381#issuecomment-4190351686

   > When `SortExec` is eliminated via sort pushdown (statistics-based file 
reordering), `SortPreservingMergeExec` reads directly from I/O-bound sources 
instead of from `SortExec`'s in-memory buffer. Currently SPM does a single 
K-way merge of all input streams, which can become a bottleneck when there are 
many partitions.
   
   The current implementation already forms a tree of SPMs similar to what you 
have described, so there shouldn’t be a major bottleneck I think, though there 
may still be room for improvement.
   
   Currently, there are two stages of SPM:
   - Stage 1: `SortExec` output, where the merge degree equals the number of 
in-memory batches
   - Stage 2: Final `SortPreservingMergeExec`, where the merge degree equals 
the number of partitions
   
   Stage 1 is non-blocking, so the two SPM stages can run in parallel. The main 
optimization opportunity seems to be balancing the merge degrees across the two 
stages to avoid one becoming a bottleneck. For example, if `SortExec` performs 
a 4-way merge while Stage 2 performs a 32-way merge, Stage 2 is likely to be 
the bottleneck. This could potentially be improved with a more recursive 
merging scheme.
   
   > Related: DuckDB implements a similar [parallel merge 
sort](https://duckdb.org/2021/08/27/external-sorting.html) strategy.
   
   They have a more recent version with a parallel k-way merge algorithm: 
https://duckdb.org/2025/09/24/sorting-again#k-way-merge-path


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to