[I] Some column statistics are missing after writing data to a table [iceberg-python]

via GitHub Wed, 01 Jan 2025 02:43:56 -0800


rotem-ad opened a new issue, #1482:
URL: https://github.com/apache/iceberg-python/issues/1482


   ### Apache Iceberg version
   
   0.8.1 (latest release)
   
   ### Please describe the bug 🐞
   
   Following this [Slack 
thread](https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1734704077106409):
 
   
   Seems like column statistics are not fully collected when writing data 
either by using the Arrow Dataframe API or the `add_files` method.
   
   **Example 1:** Stats after using `add_files` method (shown using Trino `show 
stats`):
   ```sql
   show stats for iceberg_test.tests.my_table_add_files;
   
   
+-------------------------+---------+---------------------+--------------+---------+--------------------+-------------------+
   |column_name              
|data_size|distinct_values_count|nulls_fraction|row_count|low_value           
|high_value         |
   
+-------------------------+---------+---------------------+--------------+---------+--------------------+-------------------+
   |id                       |null     |null                 |0             
|null     |-9223372031483295744|9223372014179617792|
   |location                 |null     |null                 |null          
|null     |null                |null               |
   |probability              |null     |null                 |0             
|null     |0.2500010132789612  |1.0                |
   |null                     |null     |null                 |null          
|897861060|null                |null               |
   
+-------------------------+---------+---------------------+--------------+---------+--------------------+-------------------+
   ```
   As you can see, `location` column is missing statistics when using 
`add_files`.
   In addition, while calling `add_files` I encountered this error many times 
(but the operation succeeded):
   ```
   PyArrow statistics missing for column 1 when writing file
   ```
   
   **Example 2:** Stats after using Arrow Dataframe API to load **the same 
Parquet files** (shown using Trino `show stats`):
   ```sql
   show stats for iceberg_test.tests.my_table_df;
   
   
+-------------------------+-----------+---------------------+--------------+---------+--------------------+-------------------+
   |column_name              |data_size  
|distinct_values_count|nulls_fraction|row_count|low_value           |high_value 
        |
   
+-------------------------+-----------+---------------------+--------------+---------+--------------------+-------------------+
   |id                       |null       |null                 |0             
|null     |-9223372031483295744|9223372014179617792|
   |location                 |14001826255|null                 |0             
|null     |null                |null               |
   |probability              |null       |null                 |0             
|null     |0.2500010132789612  |1.0                |
   |null                     |null       |null                 |null          
|897861060|null                |null               |
   
+-------------------------+-----------+---------------------+--------------+---------+--------------------+-------------------+
   ```
   
   On the other hand, after collecting table statistics using Trino, the column 
statistics look more complete:
   ```sql
   analyze iceberg_test.tests.my_table_df;
   
   show stats for iceberg_test.tests.my_table_df;
   
   
+-------------------------+-----------+---------------------+--------------+---------+--------------------+-------------------+
   |column_name              |data_size  
|distinct_values_count|nulls_fraction|row_count|low_value           |high_value 
        |
   
+-------------------------+-----------+---------------------+--------------+---------+--------------------+-------------------+
   |id                       |null       |446189731            |0             
|null     |-9223372031483295744|9223372014179617792|
   |location                 |14001826255|15973993             |0             
|null     |null                |null               |
   |probability              |null       |4159975              |0             
|null     |0.2500010132789612  |1.0                |
   |null                     |null       |null                 |null          
|897861060|null                |null               |
   
+-------------------------+-----------+---------------------+--------------+---------+--------------------+-------------------+
   ```
   
   
   ### Willingness to contribute
   
   - [ ] I can contribute a fix for this bug independently
   - [ ] I would be willing to contribute a fix for this bug with guidance from 
the Iceberg community
   - [X] I cannot contribute a fix for this bug at this time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[I] Some column statistics are missing after writing data to a table [iceberg-python]

Reply via email to