asolimando commented on code in PR #21081:
URL: https://github.com/apache/datafusion/pull/21081#discussion_r3021756638
##########
datafusion/common/src/stats.rs:
##########
@@ -551,6 +551,10 @@ impl Statistics {
}
Precision::Absent => Precision::Absent,
};
+ // NDV can never exceed the number of rows
+ if let Some(&rows) = self.num_rows.get_value() {
+ cs.distinct_count =
cs.distinct_count.min(&Precision::Inexact(rows));
Review Comment:
Thanks for your comments, @gene-bordegaray!
All comments share the same concern, so I will address them together.
NDV being less than or equal the number of rows, is an invariant that must
always hold, you can never have more distinct values than rows, regardless of
the precision level.
`Inexact(N)` means "at most N" in DataFusion's semantics for statistics. If
`num_rows` is `Inexact(10)`, there are at most 10 rows, hence at most 10
distinct values. Capping NDV at `Inexact(10)` is correct, any value higher than
10 would make the statistics incoherent.
You are correctly wondering how this interacts with `with_fetch`, and as
suggested a test to clarify this point is needed, I have added it in
f43b07163cd0c628721d0ed739d490ec308f67e4, showing that we don't interfere
negatively with that.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]