asolimando commented on PR #21996: URL: https://github.com/apache/datafusion/pull/21996#issuecomment-4367087797
Really interesting work @adriangb, addressing an important gap, this has potential to unlock a lot of further development on the statistics front. I've been working on several related pieces (ExpressionAnalyzer #21122, StatisticsRegistry #21483, StatisticsContext #21815) and wanted to offer some perspective on how they might connect. ## On the ExpressionAnalyzer (#21122) relationship Your framing of "orthogonal with overlap" is very convincing to me, but I wonder if the split might be even cleaner than presented. Keeping `StatisticsRequest` column-based (or extending to nested paths like `get_field` where the data source genuinely has the stat) seems like the natural boundary. For the `JOIN ON coalesce(t1.col, '') = t2.col` example you cite, one option would be to request stats on `t1.col` and let `ExpressionAnalyzer` derive how `coalesce` transforms them, rather than pushing expression semantics into the provider, which might have their own very different notion (classic impedance mismatch in DB systems). Your `RequestStatistics` rule already seems to be doing this naturally, since `add_column_refs` extracts base columns, ignoring expression structure, this feels like the right balance. ## Sketches, advanced stats, and the broader statistics infrastructure You mention "room to grow toward sketches, histograms, and selectivity stats later", which reminded me of the `ExtendedStatistics` mechanism introduced in the `StatisticsRegistry` (#21483), with exactly that use-case in mind. The `StatisticsRegistry` provides a way to override/enrich statistics for any operator via customizable `StatisticsProvider` (with existing built-in versions as base). `ExtendedStatistics` is its type-erased extension map that custom providers can populate and other custom ones can consume, without modifying core DataFusion types. It would be nice to think about how the request/response API from this PR could feed custom stats into `ExtendedStatistics` (e.g., a table provider that answers a `StatisticsRequest` with a sketch or histogram could surface that data where `StatisticsProvider` implementations can pick it up during propagation). If `StatisticsContext` (#21815, in review) lands, this could find its natural home there. The longer-term idea is for `StatisticsContext` to also carry the `ExpressionAnalyzer` and `StatisticsRegistry` themselves, so that all statistics infrastructure (collection, expression-level derivation, operator-level propagation, and custom extensions) is accessible from a single place, opening the door for richer stats flowing end-to-end. Happy to discuss in more detail if that would be useful! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
