rotem-ad opened a new issue, #1482: URL: https://github.com/apache/iceberg-python/issues/1482
### Apache Iceberg version 0.8.1 (latest release) ### Please describe the bug 🐞 Following this [Slack thread](https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1734704077106409): Seems like column statistics are not fully collected when writing data either by using the Arrow Dataframe API or the `add_files` method. **Example 1:** Stats after using `add_files` method (shown using Trino `show stats`): ```sql show stats for iceberg_test.tests.my_table_add_files; +-------------------------+---------+---------------------+--------------+---------+--------------------+-------------------+ |column_name |data_size|distinct_values_count|nulls_fraction|row_count|low_value |high_value | +-------------------------+---------+---------------------+--------------+---------+--------------------+-------------------+ |id |null |null |0 |null |-9223372031483295744|9223372014179617792| |location |null |null |null |null |null |null | |probability |null |null |0 |null |0.2500010132789612 |1.0 | |null |null |null |null |897861060|null |null | +-------------------------+---------+---------------------+--------------+---------+--------------------+-------------------+ ``` As you can see, `location` column is missing statistics when using `add_files`. In addition, while calling `add_files` I encountered this error many times (but the operation succeeded): ``` PyArrow statistics missing for column 1 when writing file ``` **Example 2:** Stats after using Arrow Dataframe API to load **the same Parquet files** (shown using Trino `show stats`): ```sql show stats for iceberg_test.tests.my_table_df; +-------------------------+-----------+---------------------+--------------+---------+--------------------+-------------------+ |column_name |data_size |distinct_values_count|nulls_fraction|row_count|low_value |high_value | +-------------------------+-----------+---------------------+--------------+---------+--------------------+-------------------+ |id |null |null |0 |null |-9223372031483295744|9223372014179617792| |location |14001826255|null |0 |null |null |null | |probability |null |null |0 |null |0.2500010132789612 |1.0 | |null |null |null |null |897861060|null |null | +-------------------------+-----------+---------------------+--------------+---------+--------------------+-------------------+ ``` On the other hand, after collecting table statistics using Trino, the column statistics look more complete: ```sql analyze iceberg_test.tests.my_table_df; show stats for iceberg_test.tests.my_table_df; +-------------------------+-----------+---------------------+--------------+---------+--------------------+-------------------+ |column_name |data_size |distinct_values_count|nulls_fraction|row_count|low_value |high_value | +-------------------------+-----------+---------------------+--------------+---------+--------------------+-------------------+ |id |null |446189731 |0 |null |-9223372031483295744|9223372014179617792| |location |14001826255|15973993 |0 |null |null |null | |probability |null |4159975 |0 |null |0.2500010132789612 |1.0 | |null |null |null |null |897861060|null |null | +-------------------------+-----------+---------------------+--------------+---------+--------------------+-------------------+ ``` ### Willingness to contribute - [ ] I can contribute a fix for this bug independently - [ ] I would be willing to contribute a fix for this bug with guidance from the Iceberg community - [X] I cannot contribute a fix for this bug at this time -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org