asolimando commented on PR #20926: URL: https://github.com/apache/datafusion/pull/20926#issuecomment-4183577716
> Thanks for sharing the context! happy to take a look. Thanks @2010YOUY01 for the thoughtful feedback, let me reply in-line to be sure not to miss anything! I will only cover in this first reply comments strictly related to this PR, the rest is very important but it's a broader CBO/stats topic, I will make a second reply just for that. > ### Questions on this PR > I don't have prior experience on CBO, and I'm still working through the relevant material. Here are several questions: > > * why we assume independence between grouping keys — is this just a simplifying heuristic, or do reference systems intentionally do this because overestimating NDV leads to better end-to-end performance (e.g., for join reordering)? This is a standard simplifying heuristic in this space, used by both Spark and Trino. In practice, in presence of correlated columns, product would be an overestimate. Note that we cap NDV to the number of rows to protect from severe overestimation. DuckDB defaults to number of rows directly for aggregations ([code](https://github.com/duckdb/duckdb/blob/2bdd7b54b18a99a31f4bdba240cfaede0b7fbee4/src/optimizer/statistics/operator/propagate_aggregate.cpp#L390-L391)), which is an even stronger assumption (one group per input row, only true when the grouping expressions involve a primary key or unique id). So we have a more precise estimation on average, and we are no worse than them for the worst case thanks to NDV capping. Keeping a main reference makes sense and would probably make our life easier, but I think it's pragmatic to improve further when there is a good opportunity and we understand the implications, like in this case. > * why we explicitly account for nulls — this adds implementation complexity, but seems to have only a minor impact I think the few one-liners handling nulls (`.max(1)` and adding a `+1` when nulls are present) aren't adding much complexity, considering that nulls are pretty common to have. Cardinality estimation errors compound quickly, I think it's reasonable to trade some little more complexity for higher precision. In general, accounting for nulls in aggregations is pretty common, you can refer to the Spark and Trino references we added as in code comments, but most other systems do handle nulls similarly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
