Re: [I] Use Min, Max, and NumOfNulls from Manifest Files for Spark Column Stats [iceberg]

via GitHub Tue, 03 Sep 2024 09:49:15 -0700


huaxingao commented on issue #10791:
URL: https://github.com/apache/iceberg/issues/10791#issuecomment-2326996054


   If data files are filtered out by the query predicate, the pushed-down
   min/max/null counts are no longer accurate. Spark takes filter estimation
   into consideration when calculating stats for CBO, but I am not sure how
   accurate it is. Computing these stats on the fly is expensive.
   
   Huaxin
   
   On Tue, Sep 3, 2024 at 5:41 AM Guy Khazma ***@***.***> wrote:
   
   > @huaxingao <https://github.com/huaxingao> @karuppayya
   > <https://github.com/karuppayya> @jeesou <https://github.com/jeesou>
   > @aokolnychyi <https://github.com/aokolnychyi> @alexjo2144
   > <https://github.com/alexjo2144> @findepi <https://github.com/findepi>
   > @manishmalhotrawork <https://github.com/manishmalhotrawork>
   > Continuing the discussion from the mailing list
   > <https://lists.apache.org/thread/6kyvp5xk5g46325ztvzxx3jn7q99cc1o> about
   > whether to collect the statistics during run time here since my mail
   > doesn't appear in the mailing list for some reason.
   >
   > I wanted to revisit the discussion about using partition stats for min/max
   > and null counts. It seems we might need to compute the null count at query
   > time in any case. This is because, during manifest scanning, some data
   > files may be filtered out based on query predicates. This could lead to a
   > situation where the number of rows is less than the number of nulls for a
   > partition or table if these counts are collected statically. In such cases,
   > Spark might incorrectly estimate zero rows if an isNotNull predicate is
   > used.
   >
   > However, min/max values can still be pre-computed at the partition level,
   > as they remain valid as lower and upper bounds even with additional
   > filtering.
   >
   > Any thoughts? If collecting null counts (and possibly min/max values) on
   > the fly seems reasonable, I can open a PR to implement it.
   >
   > —
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/iceberg/issues/10791#issuecomment-2326424328>,
   > or unsubscribe
   > 
<https://github.com/notifications/unsubscribe-auth/ADHWNQQQ54A27FK64S3JY5TZUWU55AVCNFSM6AAAAABLRKE3ROVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMRWGQZDIMZSHA>
   > .
   > You are receiving this because you were mentioned.Message ID:
   > ***@***.***>
   >
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Use Min, Max, and NumOfNulls from Manifest Files for Spark Column Stats [iceberg]

Reply via email to