asolimando opened a new pull request, #21815:
URL: https://github.com/apache/datafusion/pull/21815

   ## Which issue does this PR close?
   
   Closes #20184
   
   ## Rationale for this change
   
   `ExecutionPlan::partition_statistics` forces each operator to re-fetch child 
statistics internally, causing exponential recomputation in deep plans and 
making it impossible to inject enriched statistics from external sources (e.g., 
expression-level analyzers, custom statistics providers).
   
   ## What changes are included in this PR?
   
   Breaking change: the `ExecutionPlan::partition_statistics` signature changes 
from `(&self, partition: Option<usize>)` to `(&self, partition: Option<usize>, 
ctx: &StatisticsContext)`. Migration guide added to 
`docs/source/library-user-guide/upgrading/54.0.0.md`.
   
   Add a `StatisticsContext` parameter to `partition_statistics` that carries 
pre-computed child statistics, and a `compute_statistics()` utility that walks 
the plan tree bottom-up, threading child statistics through the context 
automatically. `StatisticsContext` carries one `Arc<Statistics>` per child node 
and is designed to be extended with additional context (e.g., expression-level 
analyzers, custom statistics providers) without further signature changes.
   
   ### Operator categories
   
   - Leaf nodes (EmptyExec, PlaceholderRowExec, WorkTableExec, DataSourceExec): 
ignore the context, return their own stats. DataSourceExec delegates to the 
`DataSource` trait which has a separate `partition_statistics` that was not 
changed.
   - Passthrough operators (BufferExec, CooperativeExec, PartialSortExec, 
OutputRequirementExec): return `ctx.child_stats()[0]` directly.
   - Transform operators (FilterExec, ProjectionExec, AggregateExec, 
WindowAggExec, CoalesceBatchesExec, GlobalLimitExec, LocalLimitExec): use 
`ctx.child_stats()[0]` as input, then apply their transformation (selectivity, 
column projection, grouping cardinality, fetch limit, etc.).
   - Partition-merging operators (CoalescePartitionsExec, 
SortPreservingMergeExec, SortExec with `!preserve_partitioning`, 
RepartitionExec): always need overall child stats regardless of which output 
partition is requested, since they merge/redistribute input partitions. These 
call `compute_statistics(child, None)` internally instead of using the context.
   - Symmetric joins (SortMergeJoinExec): both sides use the same partition, so 
`ctx.child_stats()` is correct for both `None` and `Some(i)` cases.
   - Asymmetric joins (HashJoinExec CollectLeft, CrossJoinExec, 
NestedLoopJoinExec): the left (broadcast) side always needs overall stats, so 
they call `compute_statistics(left, None)` for the `Some` case. The right side 
is partitioned and uses `ctx.child_stats()[1]` directly. HashJoinExec 
Partitioned mode is symmetric (both use context). HashJoinExec Auto mode needs 
overall stats from both sides.
   - Union/Interleave: UnionExec uses `ctx.child_stats()` for the `None` case 
(reduces with `stats_union`). For `Some(partition)`, Union remaps partition 
indices across children and calls `compute_statistics` on the specific child 
with the remapped index. InterleaveExec uses `ctx.child_stats()` directly 
(symmetric across all inputs).
   
   ### Callers
   
   All direct `plan.partition_statistics(None)` calls in optimizer rules 
(JoinSelection, AggregateStatistics, EnforceDistribution), display code, 
StatisticsRegistry, and tests are replaced with `compute_statistics(plan, 
None)`.
   
   ## Tests
   
   No new tests added. This is a no-op refactoring confirmed by all existing 
tests passing unchanged across all affected crates (datafusion-physical-plan, 
datafusion-physical-optimizer, datafusion, datafusion-datasource).
   
   ## What remains for follow-up
   
   - Adding expression-level analyzers and custom statistics providers to 
`StatisticsContext` (eliminates the separate `StatisticsRegistry` tree walk and 
the `ExpressionAnalyzer` injection machinery from #21122)
   - Extending `DataSource::partition_statistics` with context if needed
   - Caching to avoid redundant `compute_statistics(child, None)` calls: 
partition-merging operators (CoalescePartitions, SortPreservingMerge, etc.) and 
asymmetric joins (HashJoin CollectLeft, CrossJoin, NestedLoopJoin) currently 
call `compute_statistics(child, None)` internally when the requested partition 
is `Some`, triggering a separate bottom-up walk. A cache on `StatisticsContext` 
keyed by (plan node, partition) would let these reuse already-computed results.
   
   ## Test plan
   
   - [x] `cargo fmt --all`
   - [x] `cargo clippy --all-targets --all-features -- -D warnings` (affected 
crates)
   - [x] `cargo test --profile ci` on datafusion-physical-plan, 
datafusion-physical-optimizer, datafusion, datafusion-datasource
   
   ----
   
   Disclaimer: I used AI to assist in the code generation, I have manually 
reviewed the output and it matches my intention and understanding.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to