[I] Column Stats reading issue for puffin file generated through Presto [iceberg]

via GitHub Wed, 28 Aug 2024 00:43:23 -0700


jeesou opened a new issue, #11034:
URL: https://github.com/apache/iceberg/issues/11034


   ### Apache Iceberg version
   
   1.6.0 (latest release)
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   As we work with multiple engines like Spark, Presto, Flink, etc., 
interoperability is crucial.
   
   Recently, PRs \[#10288\](https://github.com/apache/iceberg/pull/10288) and 
\[#10659\](https://github.com/apache/iceberg/pull/10659) added Columnar stats 
support, specifically NDV (Number of Distinct Values) for Spark. Based on this, 
we can generate the Puffin (\`.stat\`) file, which can be used by Spark for 
better query planning.
   
   The Puffin file generation logic was already developed internally by Presto, 
which creates the Puffin file when running the \`ANALYZE\` query.
   
   According to the interoperability and Puffin specification, any Puffin file 
generated by one engine should be usable by another engine. However, we 
encountered issues with Spark. The Puffin file generated by Presto contains 
another \`blob-metadata\` entry, i.e., \`"data\_size"\` of type 
\`"presto-sum-data-size-bytes-v1"\`, while \`"ndv"\` is of type 
\`"apache-datasketches-theta-v1"\`.
   
   When we used the Puffin file generated by Presto in Spark, the \`"ndv"\` 
stats were being overwritten to \`null\` for columns where the \`"data\_size"\` 
blob was present, particularly for string type fields.
   
   To resolve this, we need to ensure that Puffin files generated by Presto can 
be correctly utilized by Spark.
   
   ### Willingness to contribute
   
   - [ ] I can contribute a fix for this bug independently
   - [X] I would be willing to contribute a fix for this bug with guidance from 
the Iceberg community
   - [ ] I cannot contribute a fix for this bug at this time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[I] Column Stats reading issue for puffin file generated through Presto [iceberg]

Reply via email to