zhuqi-lucas commented on PR #21182: URL: https://github.com/apache/datafusion/pull/21182#issuecomment-4181485812
Strange — I tested locally (release build, --partitions 12 and --partitions 16) and found: 1. **Plans are identical** between main and PR for all 4 queries (SPM → DataSourceExec, no SortExec in either case) 2. **Performance is identical** after multiple warm iterations: ``` --partitions 12, release, 5 iterations: Q1: main 112ms vs PR 108ms (~same) Q2: main 2.6ms vs PR 2.4ms (~same) Q3: main 300ms vs PR 299ms (~same) Q4: main 6.3ms vs PR 6.0ms (~same) ``` The GKE benchmark runs main and PR on different instances (different machine names in bot output), which could explain the consistent per-run variance. Our code doesn't trigger in this scenario because `EnforceSorting` already eliminates `SortExec` after byte-range splitting creates single-file groups. The optimization triggers when a partition has **multiple files in wrong order** (e.g., `--partitions 1` or `split_file_groups_by_statistics=true`). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
