asolimando commented on code in PR #21081:
URL: https://github.com/apache/datafusion/pull/21081#discussion_r3066573862
##########
datafusion/physical-plan/src/joins/utils.rs:
##########
@@ -708,7 +708,13 @@ fn max_distinct_count(
stats: &ColumnStatistics,
) -> Precision<usize> {
match &stats.distinct_count {
- &dc @ (Precision::Exact(_) | Precision::Inexact(_)) => dc,
+ &dc @ (Precision::Exact(_) | Precision::Inexact(_)) => {
+ // NDV can never exceed the number of rows
+ match num_rows {
+ Precision::Absent => dc,
+ _ => dc.min(num_rows).to_inexact(),
Review Comment:
You are right, and there is actually a case where `Exact` should be
preserved: it's when we don't cap and both are `Exact` (that is `Exact(ndv/dc)`
< `Exact(num_rows)`), all the rest is fine as `Inexact`, but this case was
mishandled.
This made me realize that test coverage should be improved around `Exact`,
so I added more tests covering all combinations (with and without capping) in
840c6ceaa.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]