asolimando commented on issue #21624:
URL: https://github.com/apache/datafusion/issues/21624#issuecomment-4255264885

   I totally agree @adriangb, this is a real pain point, especially for wide 
tables.
   
   Figuring out which columns need stats and what kind seems doable from the 
plan. Stats requirements are mostly driven by expressions: a range predicate 
(`x > 10`) needs min/max, an equality (`x = 'abc'` or a join condition `a.id = 
b.id`) needs NDV, `IS NULL` needs null_count. The expression shape tells you 
the stats kind. The exception is grouping contexts (GROUP BY, window PARTITION 
BY, DISTINCT), where the operator itself creates a need for NDV on expressions 
that otherwise would not need any stats.
   
   One thing to note: CBO and pruning both use column statistics but at 
different granularity. CBO needs table-level aggregates (one NDV per column for 
join ordering), while pruning needs per-file/per-row-group stats (min/max to 
skip files). So the requirements are really (column, stats_kind, granularity).
   
   Another thing: operators higher in the plan refer to expressions that may go 
through projections, casts, aliases, etc. To find what the scan needs to 
collect, you have to trace those expressions back down through those 
transformations to the source columns. This is the reverse of bottom-up stats 
propagation.
   
   I checked `PruningPredicate::required_columns()` that @alamb mentioned, it 
does seems very relevant, but from a cursory look it seems to be working at the 
physical level where the filter already refers to scan columns (exploiting 
filter pushdown basically).
   
   We would need to generalize the same approach to the full plan (joins, 
aggregates, and other stats-sensitive operators), this would require the 
reverse tracing described above. DataFusion already has a pattern for this kind 
of top-down requirement propagation: `required_input_distribution()` and 
`required_input_ordering()`. A `required_input_statistics()` following the same 
shape could collect and propagate stats requirements down to the scan.
   
   The harder part is which files need stats. That depends on runtime (a LIMIT 
query might stop after 2 files out of 100) and likely needs lazy collection. 
These seem like two separate problems: column/kind scoping (can be planned) vs 
file scoping (runtime), I suggest we address them separately if possible as 
each of them seem complex enough.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to