guykhazma commented on issue #10791: URL: https://github.com/apache/iceberg/issues/10791#issuecomment-2327104016
@huaxingao min/max will not stay accurate but still provide valid lower and upper bounds. The issue I am seeing with null counts is that when spark gets for example a combination of conditions such as: ``` o_orderdate > 5 and isnotnull(o_ordedate) ``` it will try to calculate the selectivity of the combined predicate by multiplying the selectivity for each predicate. The evaluation for `isnotnull` doesn't take into account the predicates (see [here](https://github.com/apache/spark/blob/48f9cc7e716d7c3568049a80a9af7ca1b5c9ec01/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala#L235)) and since the row count is using the filtered size it might lead to a case where the null count is falsely larger than the row count. As for it being expensive, maybe it is something worth benchmarking? else, we would need a solution of the null handling. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org