asolimando commented on issue #21624: URL: https://github.com/apache/datafusion/issues/21624#issuecomment-4255264885
I totally agree @adriangb, this is a real pain point, especially for wide tables. Figuring out which columns need stats and what kind seems doable from the plan. Stats requirements are mostly driven by expressions: a range predicate (`x > 10`) needs min/max, an equality (`x = 'abc'` or a join condition `a.id = b.id`) needs NDV, `IS NULL` needs null_count. The expression shape tells you the stats kind. The exception is grouping contexts (GROUP BY, window PARTITION BY, DISTINCT), where the operator itself creates a need for NDV on expressions that otherwise would not need any stats. One thing to note: CBO and pruning both use column statistics but at different granularity. CBO needs table-level aggregates (one NDV per column for join ordering), while pruning needs per-file/per-row-group stats (min/max to skip files). So the requirements are really (column, stats_kind, granularity). Another thing: operators higher in the plan refer to expressions that may go through projections, casts, aliases, etc. To find what the scan needs to collect, you have to trace those expressions back down through those transformations to the source columns. This is the reverse of bottom-up stats propagation. I checked `PruningPredicate::required_columns()` that @alamb mentioned, it does seems very relevant, but from a cursory look it seems to be working at the physical level where the filter already refers to scan columns (exploiting filter pushdown basically). We would need to generalize the same approach to the full plan (joins, aggregates, and other stats-sensitive operators), this would require the reverse tracing described above. DataFusion already has a pattern for this kind of top-down requirement propagation: `required_input_distribution()` and `required_input_ordering()`. A `required_input_statistics()` following the same shape could collect and propagate stats requirements down to the scan. The harder part is which files need stats. That depends on runtime (a LIMIT query might stop after 2 files out of 100) and likely needs lazy collection. These seem like two separate problems: column/kind scoping (can be planned) vs file scoping (runtime), I suggest we address them separately if possible as each of them seem complex enough. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
