Re: [PR] Estimate aggregate output rows using existing NDV statistics [datafusion]

via GitHub Fri, 03 Apr 2026 07:04:30 -0700


asolimando commented on PR #20926:
URL: https://github.com/apache/datafusion/pull/20926#issuecomment-4183577716


   > Thanks for sharing the context! happy to take a look.
   
   Thanks @2010YOUY01 for the thoughtful feedback, let me reply in-line to be 
sure not to miss anything!
   
   I will only cover in this first reply comments strictly related to this PR, 
the rest is very important but it's a broader CBO/stats topic, I will make a 
second reply just for that.
   
   > ### Questions on this PR
   > I don't have prior experience on CBO, and I'm still working through the 
relevant material. Here are several questions:
   > 
   > * why we assume independence between grouping keys — is this just a 
simplifying heuristic, or do reference systems intentionally do this because 
overestimating NDV leads to better end-to-end performance (e.g., for join 
reordering)?
   
   This is a standard simplifying heuristic in this space, used by both Spark 
and Trino. In practice, in presence of correlated columns, product would be an 
overestimate.
   
   Note that we cap NDV to the number of rows to protect from severe 
overestimation. DuckDB defaults to number of rows directly for aggregations 
([code](https://github.com/duckdb/duckdb/blob/2bdd7b54b18a99a31f4bdba240cfaede0b7fbee4/src/optimizer/statistics/operator/propagate_aggregate.cpp#L390-L391)),
 which is an even stronger assumption (one group per input row, only true when 
the grouping expressions involve a primary key or unique id).
   
   So we have a more precise estimation on average, and we are no worse than 
them for the worst case thanks to NDV capping.
   
   Keeping a main reference makes sense and would probably make our life 
easier, but I think it's pragmatic to improve further when there is a good 
opportunity and we understand the implications, like in this case.
   
   > * why we explicitly account for nulls — this adds implementation 
complexity, but seems to have only a minor impact
   
   I think the few one-liners handling nulls (`.max(1)` and adding a `+1` when 
nulls are present) aren't adding much complexity, considering that nulls are 
pretty common to have.
   
   Cardinality estimation errors compound quickly, I think it's reasonable to 
trade some little more complexity for higher precision.
   
   In general, accounting for nulls in aggregations is pretty common, you can 
refer to the Spark and Trino references we added as in code comments, but most 
other systems do handle nulls similarly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Estimate aggregate output rows using existing NDV statistics [datafusion]

Reply via email to