asolimando opened a new issue, #21443:
URL: https://github.com/apache/datafusion/issues/21443

   ### Is your feature request related to a problem or challenge?
   
   ### Is your feature request related to a problem or challenge?
   
   DataFusion currently computes operator statistics via each operator's 
built-in `partition_statistics`. While this provides basic cross-operator 
propagation, several gaps limit the effectiveness of cost-based optimizations:
   
   - No way to override statistics for existing operators: if the built-in 
estimation for, say, a `HashJoinExec` or `AggregateExec` is inaccurate for your 
workload, there is no way to plug in a better model without modifying 
DataFusion's source code
   - No support for richer statistical representations: the framework only 
carries basic statistics (row count, NDV, min/max); there is no mechanism to 
propagate richer metadata like histograms or sketches through the plan tree
   - No external stats injection: there is no way to feed richer statistics 
from external catalogs (Hive Metastore, Iceberg stats, custom catalogs) into 
the physical plan's statistics propagation
   
   Without an extension point, advancing CBO in DataFusion becomes increasingly 
difficult: more sophisticated estimation strategies are hard to land in core 
because they add complexity that not all users need, and downstream projects 
embedding DataFusion have no way to bring their own estimation without forking. 
   
   See [this 
discussion](https://github.com/apache/datafusion/pull/20926#issuecomment-4183702274)
 for context on how pluggable statistics would help DataFusion evolve its CBO 
beyond any single reference system while keeping the framework extensible.
   
   Related: #8227 (statistics improvements epic), #20184 (`child_stats` in 
`partition_statistics`)
   
   ### Describe the solution you'd like
   
   A pluggable chain-of-responsibility framework for operator-level statistics, 
following the same pattern as `RelationPlanner` for SQL parsing and 
`ExpressionAnalyzer` (#21120) for expression-level statistics:
   
   - `StatisticsProvider` trait: chain element that computes statistics for a 
specific `ExecutionPlan` operator, returning either computed stats or 
delegating to the next provider
   - `StatisticsRegistry`: chains providers in priority order (first computed 
result wins), lives in `SessionState`
   - `StatsCache`: caller-owned per-pass memoization cache, created by the 
optimizer rule before a pass and dropped after
   - `ExtendedStatistics`: standard `Statistics` plus a type-erased extension 
map for custom metadata (histograms, sketches, etc.)
   
   The registry walks the plan tree bottom-up: for each node, it first 
recursively computes enhanced child stats, then passes them to the provider 
chain. Each provider can use the enhanced child stats to produce a better 
estimate than the built-in `partition_statistics`, or delegate. Results are 
cached per-node in the `StatsCache` to avoid redundant walks within the same 
pass.
   
   The framework should:
   
   - Ship with built-in providers for all standard physical operators (Filter, 
Projection, Aggregate, Join, Limit, Union, etc.)
   - Allow users to register custom providers via `SessionState` for overriding 
built-in estimation
   - Provide at least one optimizer rule integration as an example (e.g., 
`JoinSelection`)
   - Compose with the expression-level `ExpressionAnalyzer` (#21120): the 
registry feeds enhanced child stats into operators, which internally use the 
expression analyzer for expression-level estimation
   - Be purely additive, gated by a config flag
   
   The design avoids breaking changes, but some upstream changes would simplify 
the architecture if the community is open to them:
   
   - If `partition_statistics` accepted `child_stats` (#20184), the registry 
could feed enhanced stats into the built-in path directly, eliminating the 
separate bottom-up tree walk and ensuring column-level statistics (NDV, 
min/max) propagate through all operators
   - If `Statistics` carried a type-erased extension map (similar to 
`ExtendedStatistics`), extensions (histograms, sketches) would flow naturally 
through `partition_statistics` and the separate registry walk for extension 
propagation could be dropped entirely
   
   ### Describe alternatives you've considered
   
   Storing enhanced statistics directly on plan nodes via `PlanProperties`. 
This would avoid the external walk but requires modifying a core type that 
every `ExecutionPlan` implementation depends on, making the change 
significantly more invasive. The external registry is purely additive and 
follows standard practice in query optimizers (Calcite, Trino, Spark all 
compute stats on demand after plan transformations).
   
   ### Planned work
   
   Framework
   - [ ] StatisticsProvider trait, StatisticsRegistry, StatsCache, 
ExtendedStatistics
   - [ ] DefaultStatisticsProvider (fallback to partition_statistics)
   - [ ] SessionState integration, config flag
   
   Built-in providers
   - [ ] FilterStatisticsProvider (selectivity + post-filter NDV adjustment)
   - [ ] ProjectionStatisticsProvider (column mapping through projections)
   - [ ] PassthroughStatisticsProvider (schema-preserving, 
cardinality-preserving operators: Sort, Repartition, CoalesceBatches, etc.)
   - [ ] AggregateStatisticsProvider (NDV-product estimation for GROUP BY)
   - [ ] JoinStatisticsProvider (NDV-based join estimation, multi-key, 
join-type-aware bounds, hash/sort-merge/nested-loop/cross)
   - [ ] LimitStatisticsProvider (local + global, skip + fetch)
   - [ ] UnionStatisticsProvider (row count summation with `Absent` propagation)
   
   Optimizer integration
   - [ ] Provide at least one optimizer rule integration as an example (e.g., 
`JoinSelection`)
   
   Other physical optimizer decisions that could benefit from enhanced 
statistics include aggregate mode selection (single vs two-phase based on 
estimated group count), dynamic join algorithm selection (hash vs sort-merge), 
and dynamic filter activation based on estimated selectivity, among others.
   
   ### Additional context
   
   - Related: #21120 (`ExpressionAnalyzer`, expression-level complement), 
#20184 (`child_stats` in `partition_statistics`, prerequisite for column-level 
propagation), #20926 (aggregate NDV estimation), #8227 (statistics improvements 
epic), #15873 (`partition_statistics` operator support tracking)
   - `PhysicalOptimizerRule::optimize` currently receives only `ConfigOptions`; 
rules needing additional context (statistics registry, expression analyzer) 
require a richer context parameter or a workaround
   - 
[RelMetadataQuery](https://calcite.apache.org/javadocAggregate/org/apache/calcite/rel/metadata/RelMetadataQuery.html)
 / 
[RelMetadataProvider](https://calcite.apache.org/javadocAggregate/org/apache/calcite/rel/metadata/RelMetadataProvider.html)
 pattern, adapted for Rust's ownership model
   
   ### Describe the solution you'd like
   
   _No response_
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to