guptaakashdeep opened a new issue, #1884: URL: https://github.com/apache/iceberg-python/issues/1884
### Apache Iceberg version 0.9.0 (latest release) ### Please describe the bug 🐞 ## Issue: `table.inspect.entries()` fails when table is MOR table and has Delete Files present in it. Iceberg MOR Table is created via Apache Spark 3.5.0 with Iceberg 1.5.0 and it's being read via PyIceberg 0.9.0 using `StaticTable.from_metadata()`. ### Stacktrace: ```bash --------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[2], line 1 ----> 1 table.inspect.entries() File ~/Documents/project-repos/git-repos/lakehouse-health-analyzer/venv/lib/python3.12/site-packages/pyiceberg/table/inspect.py:208, in InspectTable.entries(self, snapshot_id) 188 partition = entry.data_file.partition 189 partition_record_dict = { 190 field.name: partition[pos] 191 for pos, field in enumerate(self.tbl.metadata.specs()[manifest.partition_spec_id].fields) 192 } 194 entries.append( 195 { 196 "status": entry.status.value, 197 "snapshot_id": entry.snapshot_id, 198 "sequence_number": entry.sequence_number, 199 "file_sequence_number": entry.file_sequence_number, 200 "data_file": { 201 "content": entry.data_file.content, 202 "file_path": entry.data_file.file_path, 203 "file_format": entry.data_file.file_format, 204 "partition": partition_record_dict, 205 "record_count": entry.data_file.record_count, 206 "file_size_in_bytes": entry.data_file.file_size_in_bytes, 207 "column_sizes": dict(entry.data_file.column_sizes), --> 208 "value_counts": dict(entry.data_file.value_counts), 209 "null_value_counts": dict(entry.data_file.null_value_counts), 210 "nan_value_counts": dict(entry.data_file.nan_value_counts), 211 "lower_bounds": entry.data_file.lower_bounds, 212 "upper_bounds": entry.data_file.upper_bounds, 213 "key_metadata": entry.data_file.key_metadata, 214 "split_offsets": entry.data_file.split_offsets, 215 "equality_ids": entry.data_file.equality_ids, 216 "sort_order_id": entry.data_file.sort_order_id, 217 "spec_id": entry.data_file.spec_id, 218 }, 219 "readable_metrics": readable_metrics, 220 } 221 ) 223 return pa.Table.from_pylist( 224 entries, 225 schema=entries_schema, 226 ) TypeError: 'NoneType' object is not iterable ``` ## Replication This issue can be replicated by following the instructions below: 1. Create an Iceberg MOR table using `Spark 3.5.0` with `Iceberg 1.5.0` ```python from pyspark.sql import SparkSession from pyspark.sql.functions import lit, array, rand DW_PATH='../warehouse' spark = SparkSession.builder \ .master("local[4]") \ .appName("iceberg-mor-test") \ .config('spark.jars.packages', 'org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.0,org.apache.spark:spark-avro_2.12:3.5.0')\ .config('spark.sql.extensions','org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions')\ .config('spark.sql.catalog.local','org.apache.iceberg.spark.SparkCatalog') \ .config('spark.sql.catalog.local.type','hadoop') \ .config('spark.sql.catalog.local.warehouse',DW_PATH) \ .getOrCreate() t1 = spark.range(10000).withColumn("year", lit(2023)) t1 = t1.withColumn("business_vertical", array(lit("Retail"), lit("SME"), lit("Cor"), lit("Analytics"))\ .getItem((rand()*4).cast("int"))) t1.coalesce(1).writeTo('local.db.pyic_mor_test').partitionedBy('year').using('iceberg')\ .tableProperty('format-version','2')\ .tableProperty('write.delete.mode','merge-on-read')\ .tableProperty('write.update.mode','merge-on-read')\ .tableProperty('write.merge.mode','merge-on-read')\ .create() ``` 2. Checking Table Properties to make sure table is MOR ```python spark.sql("SHOW TBLPROPERTIES local.db.pyic_mor_test").show(truncate=False) ``` ```sql +-------------------------------+-------------------+ |key |value | +-------------------------------+-------------------+ |current-snapshot-id |2543645387796664537| |format |iceberg/parquet | |format-version |2 | |write.delete.mode |merge-on-read | |write.merge.mode |merge-on-read | |write.parquet.compression-codec|zstd | |write.update.mode |merge-on-read | +-------------------------------+-------------------+ ``` 3. Running an `UPDATE` statement to generate a Delete File ```python spark.sql(f"UPDATE local.db.pyic_mor_test SET business_vertical = 'DataEngineering' WHERE id > 7000") ``` 4. Checking if Delete File is generated ```python spark.table(f"local.db.pyic_mor_test.delete_files").show() ``` ```sql +-------+--------------------+-----------+-------+---------+------------+------------------+--------------------+------------+-----------------+----------------+--------------------+--------------------+------------+-------------+------------+-------------+--------------------+ |content| file_path|file_format|spec_id|partition|record_count|file_size_in_bytes| column_sizes|value_counts|null_value_counts|nan_value_counts| lower_bounds| upper_bounds|key_metadata|split_offsets|equality_ids|sort_order_id| readable_metrics| +-------+--------------------+-----------+-------+---------+------------+------------------+--------------------+------------+-----------------+----------------+--------------------+--------------------+------------+-------------+------------+-------------+--------------------+ | 1|/Users/akashdeepg...| PARQUET| 0| {2023}| 2999| 4878|{2147483546 -> 21...| NULL| NULL| NULL|{2147483546 -> [2...|{2147483546 -> [2...| NULL| [4]| NULL| NULL|{{NULL, NULL, NUL...| +-------+--------------------+-----------+-------+---------+------------+------------------+--------------------+------------+-----------------+----------------+--------------------+--------------------+------------+-------------+------------+-------------+--------------------+ ``` ### Reading Spark created table from PyIceberg ```python from pyiceberg.table import StaticTable # Using latest metadata.json path metadata_path = "./warehouse/db/pyic_mor_test/metadata/v2.metadata.json" table = StaticTable.from_metadata(metadata_path) # This will break with the stacktrace provided above table.inspect.entries() ``` ## Issue found after debugging I did some debugging and figured out the `inspect.entries()` break for MOR tables while reading the `*-delete.parquet` files present in table. While reading the Delete file, `value_counts` is coming as null. I can see that `ManifestEntryStatus` is `ADDED` and `DataFile` content is `DataFileContent.POSITION_DELETES` which seems to be correct. I further looked into the `manifest.avro` file that holds the entry for delete parquet files. And well, `value_counts` populated there itself is `NULL`. That's the reason `entry.data_file.value_counts` is coming as `null`. `value_counts` as `null` can also be seen in above in the output of query of `delete_files` table. ### Willingness to contribute - [ ] I can contribute a fix for this bug independently - [x] I would be willing to contribute a fix for this bug with guidance from the Iceberg community - [ ] I cannot contribute a fix for this bug at this time -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org