asolimando opened a new issue, #21443: URL: https://github.com/apache/datafusion/issues/21443
### Is your feature request related to a problem or challenge? ### Is your feature request related to a problem or challenge? DataFusion currently computes operator statistics via each operator's built-in `partition_statistics`. While this provides basic cross-operator propagation, several gaps limit the effectiveness of cost-based optimizations: - No way to override statistics for existing operators: if the built-in estimation for, say, a `HashJoinExec` or `AggregateExec` is inaccurate for your workload, there is no way to plug in a better model without modifying DataFusion's source code - No support for richer statistical representations: the framework only carries basic statistics (row count, NDV, min/max); there is no mechanism to propagate richer metadata like histograms or sketches through the plan tree - No external stats injection: there is no way to feed richer statistics from external catalogs (Hive Metastore, Iceberg stats, custom catalogs) into the physical plan's statistics propagation Without an extension point, advancing CBO in DataFusion becomes increasingly difficult: more sophisticated estimation strategies are hard to land in core because they add complexity that not all users need, and downstream projects embedding DataFusion have no way to bring their own estimation without forking. See [this discussion](https://github.com/apache/datafusion/pull/20926#issuecomment-4183702274) for context on how pluggable statistics would help DataFusion evolve its CBO beyond any single reference system while keeping the framework extensible. Related: #8227 (statistics improvements epic), #20184 (`child_stats` in `partition_statistics`) ### Describe the solution you'd like A pluggable chain-of-responsibility framework for operator-level statistics, following the same pattern as `RelationPlanner` for SQL parsing and `ExpressionAnalyzer` (#21120) for expression-level statistics: - `StatisticsProvider` trait: chain element that computes statistics for a specific `ExecutionPlan` operator, returning either computed stats or delegating to the next provider - `StatisticsRegistry`: chains providers in priority order (first computed result wins), lives in `SessionState` - `StatsCache`: caller-owned per-pass memoization cache, created by the optimizer rule before a pass and dropped after - `ExtendedStatistics`: standard `Statistics` plus a type-erased extension map for custom metadata (histograms, sketches, etc.) The registry walks the plan tree bottom-up: for each node, it first recursively computes enhanced child stats, then passes them to the provider chain. Each provider can use the enhanced child stats to produce a better estimate than the built-in `partition_statistics`, or delegate. Results are cached per-node in the `StatsCache` to avoid redundant walks within the same pass. The framework should: - Ship with built-in providers for all standard physical operators (Filter, Projection, Aggregate, Join, Limit, Union, etc.) - Allow users to register custom providers via `SessionState` for overriding built-in estimation - Provide at least one optimizer rule integration as an example (e.g., `JoinSelection`) - Compose with the expression-level `ExpressionAnalyzer` (#21120): the registry feeds enhanced child stats into operators, which internally use the expression analyzer for expression-level estimation - Be purely additive, gated by a config flag The design avoids breaking changes, but some upstream changes would simplify the architecture if the community is open to them: - If `partition_statistics` accepted `child_stats` (#20184), the registry could feed enhanced stats into the built-in path directly, eliminating the separate bottom-up tree walk and ensuring column-level statistics (NDV, min/max) propagate through all operators - If `Statistics` carried a type-erased extension map (similar to `ExtendedStatistics`), extensions (histograms, sketches) would flow naturally through `partition_statistics` and the separate registry walk for extension propagation could be dropped entirely ### Describe alternatives you've considered Storing enhanced statistics directly on plan nodes via `PlanProperties`. This would avoid the external walk but requires modifying a core type that every `ExecutionPlan` implementation depends on, making the change significantly more invasive. The external registry is purely additive and follows standard practice in query optimizers (Calcite, Trino, Spark all compute stats on demand after plan transformations). ### Planned work Framework - [ ] StatisticsProvider trait, StatisticsRegistry, StatsCache, ExtendedStatistics - [ ] DefaultStatisticsProvider (fallback to partition_statistics) - [ ] SessionState integration, config flag Built-in providers - [ ] FilterStatisticsProvider (selectivity + post-filter NDV adjustment) - [ ] ProjectionStatisticsProvider (column mapping through projections) - [ ] PassthroughStatisticsProvider (schema-preserving, cardinality-preserving operators: Sort, Repartition, CoalesceBatches, etc.) - [ ] AggregateStatisticsProvider (NDV-product estimation for GROUP BY) - [ ] JoinStatisticsProvider (NDV-based join estimation, multi-key, join-type-aware bounds, hash/sort-merge/nested-loop/cross) - [ ] LimitStatisticsProvider (local + global, skip + fetch) - [ ] UnionStatisticsProvider (row count summation with `Absent` propagation) Optimizer integration - [ ] Provide at least one optimizer rule integration as an example (e.g., `JoinSelection`) Other physical optimizer decisions that could benefit from enhanced statistics include aggregate mode selection (single vs two-phase based on estimated group count), dynamic join algorithm selection (hash vs sort-merge), and dynamic filter activation based on estimated selectivity, among others. ### Additional context - Related: #21120 (`ExpressionAnalyzer`, expression-level complement), #20184 (`child_stats` in `partition_statistics`, prerequisite for column-level propagation), #20926 (aggregate NDV estimation), #8227 (statistics improvements epic), #15873 (`partition_statistics` operator support tracking) - `PhysicalOptimizerRule::optimize` currently receives only `ConfigOptions`; rules needing additional context (statistics registry, expression analyzer) require a richer context parameter or a workaround - [RelMetadataQuery](https://calcite.apache.org/javadocAggregate/org/apache/calcite/rel/metadata/RelMetadataQuery.html) / [RelMetadataProvider](https://calcite.apache.org/javadocAggregate/org/apache/calcite/rel/metadata/RelMetadataProvider.html) pattern, adapted for Rust's ownership model ### Describe the solution you'd like _No response_ ### Describe alternatives you've considered _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
