[I] Metadata `entries` table breaks when the table configured as Merge-on-Read and has Delete Files [iceberg-python]

via GitHub Sat, 05 Apr 2025 22:42:56 -0700


guptaakashdeep opened a new issue, #1884:
URL: https://github.com/apache/iceberg-python/issues/1884


   ### Apache Iceberg version
   
   0.9.0 (latest release)
   
   ### Please describe the bug 🐞
   
   ## Issue:
   `table.inspect.entries()` fails when table is MOR table and has Delete Files 
present in it. Iceberg MOR Table is created via Apache Spark 3.5.0 with Iceberg 
1.5.0 and it's being read via PyIceberg 0.9.0 using 
`StaticTable.from_metadata()`.
   
   ### Stacktrace:
   ```bash
   ---------------------------------------------------------------------------
   TypeError                                 Traceback (most recent call last)
   Cell In[2], line 1
   ----> 1 table.inspect.entries()
   
   File 
~/Documents/project-repos/git-repos/lakehouse-health-analyzer/venv/lib/python3.12/site-packages/pyiceberg/table/inspect.py:208,
 in InspectTable.entries(self, snapshot_id)
       188         partition = entry.data_file.partition
       189         partition_record_dict = {
       190             field.name: partition[pos]
       191             for pos, field in 
enumerate(self.tbl.metadata.specs()[manifest.partition_spec_id].fields)
       192         }
       194         entries.append(
       195             {
       196                 "status": entry.status.value,
       197                 "snapshot_id": entry.snapshot_id,
       198                 "sequence_number": entry.sequence_number,
       199                 "file_sequence_number": entry.file_sequence_number,
       200                 "data_file": {
       201                     "content": entry.data_file.content,
       202                     "file_path": entry.data_file.file_path,
       203                     "file_format": entry.data_file.file_format,
       204                     "partition": partition_record_dict,
       205                     "record_count": entry.data_file.record_count,
       206                     "file_size_in_bytes": 
entry.data_file.file_size_in_bytes,
       207                     "column_sizes": 
dict(entry.data_file.column_sizes),
   --> 208                     "value_counts": 
dict(entry.data_file.value_counts),
       209                     "null_value_counts": 
dict(entry.data_file.null_value_counts),
       210                     "nan_value_counts": 
dict(entry.data_file.nan_value_counts),
       211                     "lower_bounds": entry.data_file.lower_bounds,
       212                     "upper_bounds": entry.data_file.upper_bounds,
       213                     "key_metadata": entry.data_file.key_metadata,
       214                     "split_offsets": entry.data_file.split_offsets,
       215                     "equality_ids": entry.data_file.equality_ids,
       216                     "sort_order_id": entry.data_file.sort_order_id,
       217                     "spec_id": entry.data_file.spec_id,
       218                 },
       219                 "readable_metrics": readable_metrics,
       220             }
       221         )
       223 return pa.Table.from_pylist(
       224     entries,
       225     schema=entries_schema,
       226 )
   
   TypeError: 'NoneType' object is not iterable
   ```
   
   ## Replication
   This issue can be replicated by following the instructions below:
   
   1. Create an Iceberg MOR table using `Spark 3.5.0` with `Iceberg 1.5.0`
   
   ```python
   from pyspark.sql import SparkSession
   from pyspark.sql.functions import lit, array, rand
   
   DW_PATH='../warehouse'
   spark = SparkSession.builder \
       .master("local[4]") \
       .appName("iceberg-mor-test") \
       .config('spark.jars.packages', 
'org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.0,org.apache.spark:spark-avro_2.12:3.5.0')\
       
.config('spark.sql.extensions','org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions')\
       
.config('spark.sql.catalog.local','org.apache.iceberg.spark.SparkCatalog') \
       .config('spark.sql.catalog.local.type','hadoop') \
       .config('spark.sql.catalog.local.warehouse',DW_PATH) \
       .getOrCreate()
   
   t1 = spark.range(10000).withColumn("year", lit(2023))
   t1 = t1.withColumn("business_vertical", 
                   array(lit("Retail"), lit("SME"), lit("Cor"), 
lit("Analytics"))\
                           .getItem((rand()*4).cast("int")))
   
   
t1.coalesce(1).writeTo('local.db.pyic_mor_test').partitionedBy('year').using('iceberg')\
       .tableProperty('format-version','2')\
       .tableProperty('write.delete.mode','merge-on-read')\
       .tableProperty('write.update.mode','merge-on-read')\
       .tableProperty('write.merge.mode','merge-on-read')\
       .create()
   ```
   
   2. Checking Table Properties to make sure table is MOR
   ```python
   spark.sql("SHOW TBLPROPERTIES local.db.pyic_mor_test").show(truncate=False)
   ```
   
   ```sql
   +-------------------------------+-------------------+
   |key                            |value              |
   +-------------------------------+-------------------+
   |current-snapshot-id            |2543645387796664537|
   |format                         |iceberg/parquet    |
   |format-version                 |2                  |
   |write.delete.mode              |merge-on-read      |
   |write.merge.mode               |merge-on-read      |
   |write.parquet.compression-codec|zstd               |
   |write.update.mode              |merge-on-read      |
   +-------------------------------+-------------------+
   ```
   
   3. Running an `UPDATE` statement to generate a Delete File
   
   ```python
   spark.sql(f"UPDATE local.db.pyic_mor_test SET business_vertical = 
'DataEngineering' WHERE id > 7000")
   ```
   
   4. Checking if Delete File is generated
   
   ```python
   spark.table(f"local.db.pyic_mor_test.delete_files").show()
   ```
   ```sql
   
+-------+--------------------+-----------+-------+---------+------------+------------------+--------------------+------------+-----------------+----------------+--------------------+--------------------+------------+-------------+------------+-------------+--------------------+
   |content|           
file_path|file_format|spec_id|partition|record_count|file_size_in_bytes|        
column_sizes|value_counts|null_value_counts|nan_value_counts|        
lower_bounds|        
upper_bounds|key_metadata|split_offsets|equality_ids|sort_order_id|    
readable_metrics|
   
+-------+--------------------+-----------+-------+---------+------------+------------------+--------------------+------------+-----------------+----------------+--------------------+--------------------+------------+-------------+------------+-------------+--------------------+
   |      1|/Users/akashdeepg...|    PARQUET|      0|   {2023}|        2999|    
          4878|{2147483546 -> 21...|        NULL|             NULL|            
NULL|{2147483546 -> [2...|{2147483546 -> [2...|        NULL|          [4]|      
  NULL|         NULL|{{NULL, NULL, NUL...|
   
+-------+--------------------+-----------+-------+---------+------------+------------------+--------------------+------------+-----------------+----------------+--------------------+--------------------+------------+-------------+------------+-------------+--------------------+
   ```
   
   ### Reading Spark created table from PyIceberg
   
   ```python
   from pyiceberg.table import StaticTable
   
   # Using latest metadata.json path
   metadata_path = "./warehouse/db/pyic_mor_test/metadata/v2.metadata.json"
   
   table = StaticTable.from_metadata(metadata_path)
   
   # This will break with the stacktrace provided above
   table.inspect.entries()
   ```
   
   ## Issue found after debugging
   
   I did some debugging and figured out the `inspect.entries()` break for MOR 
tables while reading the `*-delete.parquet` files present in table.
   
   While reading the Delete file, `value_counts` is coming as null. I can see 
that `ManifestEntryStatus` is `ADDED` and `DataFile` content is 
`DataFileContent.POSITION_DELETES` which seems to be correct.
   I further looked into the `manifest.avro` file that holds the entry for 
delete parquet files. And well, `value_counts` populated there itself is 
`NULL`. That's the reason `entry.data_file.value_counts` is coming as `null`.
   
   `value_counts` as `null` can also be seen in above in the output of query of 
`delete_files` table.
   
   
   ### Willingness to contribute
   
   - [ ] I can contribute a fix for this bug independently
   - [x] I would be willing to contribute a fix for this bug with guidance from 
the Iceberg community
   - [ ] I cannot contribute a fix for this bug at this time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[I] Metadata `entries` table breaks when the table configured as Merge-on-Read and has Delete Files [iceberg-python]

Reply via email to