Re: [I] Use Min, Max, and NumOfNulls from Manifest Files for Spark Column Stats [iceberg]

via GitHub Tue, 03 Sep 2024 10:52:30 -0700


guykhazma commented on issue #10791:
URL: https://github.com/apache/iceberg/issues/10791#issuecomment-2327104016


   @huaxingao min/max will not stay accurate but still provide valid lower and 
upper bounds.
   The issue I am seeing with null counts is that when spark gets for example a 
combination of conditions such as:
   ```
   o_orderdate > 5 and isnotnull(o_ordedate)
   ```
   it will try to calculate the selectivity of the combined predicate by 
multiplying the selectivity for each predicate.
   The evaluation for `isnotnull` doesn't take into account the predicates (see 
[here](https://github.com/apache/spark/blob/48f9cc7e716d7c3568049a80a9af7ca1b5c9ec01/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala#L235))
 and since the row count is using the filtered size it might lead to a case 
where the null count is falsely larger than the row count.
   
   As for it being expensive, maybe it is something worth benchmarking? else, 
we would need a solution of the null handling.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Use Min, Max, and NumOfNulls from Manifest Files for Spark Column Stats [iceberg]

Reply via email to