zhuqi-lucas commented on code in PR #21182:
URL: https://github.com/apache/datafusion/pull/21182#discussion_r3035156205
##########
datafusion/physical-plan/src/sorts/sort_preserving_merge.rs:
##########
@@ -366,7 +366,7 @@ impl ExecutionPlan for SortPreservingMergeExec {
.map(|partition| {
let stream =
self.input.execute(partition,
Arc::clone(&context))?;
- Ok(spawn_buffered(stream, 1))
+ Ok(spawn_buffered(stream, 16))
Review Comment:
Thanks @adriangb for the suggestion! Just to confirm I understand
correctly — the big win for the Inexact path would be: statistics-based file
reordering + TopK + dynamic filter pushdown, where TopK reads the first
file, sets a tight threshold, and then skips subsequent files entirely via
row group pruning?
I've also addressed the prefetch concern in the latest push — it's now
scoped to only the sort elimination path (added a prefetch field to
SortPreservingMergeExec, default 1, only set to 16 when PushdownSort eliminates
SortExec under SPM).
If splitting is still preferred, I'm happy to do that. I'd need to add
benchmarks specifically for the Inexact path (TopK + file reordering) to
validate the performance gains, because currently the benchmark result gain is
the exact data match cases. Let me know!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]