asolimando opened a new pull request, #21815: URL: https://github.com/apache/datafusion/pull/21815
## Which issue does this PR close? Closes #20184 ## Rationale for this change `ExecutionPlan::partition_statistics` forces each operator to re-fetch child statistics internally, causing exponential recomputation in deep plans and making it impossible to inject enriched statistics from external sources (e.g., expression-level analyzers, custom statistics providers). ## What changes are included in this PR? Breaking change: the `ExecutionPlan::partition_statistics` signature changes from `(&self, partition: Option<usize>)` to `(&self, partition: Option<usize>, ctx: &StatisticsContext)`. Migration guide added to `docs/source/library-user-guide/upgrading/54.0.0.md`. Add a `StatisticsContext` parameter to `partition_statistics` that carries pre-computed child statistics, and a `compute_statistics()` utility that walks the plan tree bottom-up, threading child statistics through the context automatically. `StatisticsContext` carries one `Arc<Statistics>` per child node and is designed to be extended with additional context (e.g., expression-level analyzers, custom statistics providers) without further signature changes. ### Operator categories - Leaf nodes (EmptyExec, PlaceholderRowExec, WorkTableExec, DataSourceExec): ignore the context, return their own stats. DataSourceExec delegates to the `DataSource` trait which has a separate `partition_statistics` that was not changed. - Passthrough operators (BufferExec, CooperativeExec, PartialSortExec, OutputRequirementExec): return `ctx.child_stats()[0]` directly. - Transform operators (FilterExec, ProjectionExec, AggregateExec, WindowAggExec, CoalesceBatchesExec, GlobalLimitExec, LocalLimitExec): use `ctx.child_stats()[0]` as input, then apply their transformation (selectivity, column projection, grouping cardinality, fetch limit, etc.). - Partition-merging operators (CoalescePartitionsExec, SortPreservingMergeExec, SortExec with `!preserve_partitioning`, RepartitionExec): always need overall child stats regardless of which output partition is requested, since they merge/redistribute input partitions. These call `compute_statistics(child, None)` internally instead of using the context. - Symmetric joins (SortMergeJoinExec): both sides use the same partition, so `ctx.child_stats()` is correct for both `None` and `Some(i)` cases. - Asymmetric joins (HashJoinExec CollectLeft, CrossJoinExec, NestedLoopJoinExec): the left (broadcast) side always needs overall stats, so they call `compute_statistics(left, None)` for the `Some` case. The right side is partitioned and uses `ctx.child_stats()[1]` directly. HashJoinExec Partitioned mode is symmetric (both use context). HashJoinExec Auto mode needs overall stats from both sides. - Union/Interleave: UnionExec uses `ctx.child_stats()` for the `None` case (reduces with `stats_union`). For `Some(partition)`, Union remaps partition indices across children and calls `compute_statistics` on the specific child with the remapped index. InterleaveExec uses `ctx.child_stats()` directly (symmetric across all inputs). ### Callers All direct `plan.partition_statistics(None)` calls in optimizer rules (JoinSelection, AggregateStatistics, EnforceDistribution), display code, StatisticsRegistry, and tests are replaced with `compute_statistics(plan, None)`. ## Tests No new tests added. This is a no-op refactoring confirmed by all existing tests passing unchanged across all affected crates (datafusion-physical-plan, datafusion-physical-optimizer, datafusion, datafusion-datasource). ## What remains for follow-up - Adding expression-level analyzers and custom statistics providers to `StatisticsContext` (eliminates the separate `StatisticsRegistry` tree walk and the `ExpressionAnalyzer` injection machinery from #21122) - Extending `DataSource::partition_statistics` with context if needed - Caching to avoid redundant `compute_statistics(child, None)` calls: partition-merging operators (CoalescePartitions, SortPreservingMerge, etc.) and asymmetric joins (HashJoin CollectLeft, CrossJoin, NestedLoopJoin) currently call `compute_statistics(child, None)` internally when the requested partition is `Some`, triggering a separate bottom-up walk. A cache on `StatisticsContext` keyed by (plan node, partition) would let these reuse already-computed results. ## Test plan - [x] `cargo fmt --all` - [x] `cargo clippy --all-targets --all-features -- -D warnings` (affected crates) - [x] `cargo test --profile ci` on datafusion-physical-plan, datafusion-physical-optimizer, datafusion, datafusion-datasource ---- Disclaimer: I used AI to assist in the code generation, I have manually reviewed the output and it matches my intention and understanding. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
