guykhazma commented on issue #10791:
URL: https://github.com/apache/iceberg/issues/10791#issuecomment-2326424328

   @huaxingao @karuppayya @jeesou @aokolnychyi @alexjo2144 @findepi 
@manishmalhotrawork 
   Continuing the [discussion from the mailing 
list](https://lists.apache.org/thread/6kyvp5xk5g46325ztvzxx3jn7q99cc1o) about 
whether to collect the statistics during run time here since my mail doesn't 
appear in the mailing list for some reason.
   
   I wanted to revisit the discussion about using partition stats for min/max 
and null counts. It seems we might need to compute the null count at query time 
in any case. This is because, during manifest scanning, some data files may be 
filtered out based on query predicates. This could lead to a situation where 
the number of rows is less than the number of nulls for a partition or table if 
these counts are collected statically. In such cases, Spark might incorrectly 
estimate zero rows if an isNotNull predicate is used.
   
   However, min/max values can still be pre-computed at the partition level, as 
they remain valid as lower and upper bounds even with additional filtering.
   
   Any thoughts? If collecting null counts (and possibly min/max values) on the 
fly seems reasonable, I can open a PR to implement it.
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to