[I] NULL rows are written as empty lists on columns with multiple nesting levels [iceberg-python]

via GitHub Thu, 24 Jul 2025 03:58:15 -0700


nameexhaustion opened a new issue, #2246:
URL: https://github.com/apache/iceberg-python/issues/2246


   ### Apache Iceberg version
   
   None
   
   ### Please describe the bug 🐞
   
   ### Reproducible example
   
   ```python
   from __future__ import annotations
   
   import itertools
   from datetime import datetime
   from functools import partial
   
   import pyarrow as pa
   from pyiceberg.catalog.sql import SqlCatalog
   from pyiceberg.schema import Schema as IcebergSchema
   from pyiceberg.types import (
       IntegerType,
       ListType,
       NestedField,
       StringType,
       StructType,
       TimestampType,
   )
   
   import polars as pl
   
   tmp_path = ".env/_tmp"
   
   catalog = SqlCatalog(
       "default",
       uri="sqlite:///:memory:",
       warehouse=f"file://{tmp_path}",
   )
   catalog.create_namespace("namespace")
   
   next_field_id = partial(next, itertools.count())
   
   catalog.create_table(
       "namespace.table",
       IcebergSchema(
           NestedField(
               field_id=next_field_id(),
               name="column_1",
               field_type=ListType(
                   element_id=next_field_id(),
                   element=StructType(
                       NestedField(
                           field_id=next_field_id(),
                           name="field_1",
                           field_type=ListType(
                               element_id=next_field_id(),
                               element=StructType(
                                   NestedField(field_id=next_field_id(), 
name="key", field_type=ListType(
                                       element_id=next_field_id(),
                                       element=TimestampType(),
                                       element_required=False,
                                   ), required=True),
                                   NestedField(field_id=next_field_id(), 
name="value", field_type=ListType(
                                       element_id=next_field_id(),
                                       element=IntegerType(),
                                       element_required=False,
                                   ), required=False),
                               ),
                               element_required=False
                           ),
                           required=False,
                       ),
                       NestedField(field_id=next_field_id(), name="field_2", 
field_type=IntegerType(), required=False),
                       NestedField(field_id=next_field_id(), name="field_3", 
field_type=StringType(), required=False),
                   ),
                   element_required=False,
               ),
               required=False,
           ),
       ),
   )  # fmt: skip
   
   tbl = catalog.load_table("namespace.table")
   
   df_dict = {
       "column_1": [
           [
               {
                   "field_1": [
                       {"key": [datetime(2025, 1, 1), None], "value": [1, 2, 
None]}
                   ],
                   "field_2": 7,
                   "field_3": "F3",
               }
           ],
           [
               {
                   "field_1": [{"key": [datetime(2025, 1, 1), None], "value": 
None}],
                   "field_2": 7,
                   "field_3": "F3",
               }
           ],
           [{"field_1": None, "field_2": None, "field_3": None}],
           [None],
           None,
       ],
   }
   
   arrow_tbl = pa.Table.from_pydict(
       df_dict,
       schema=pa.schema(
           [
               (
                   "column_1",
                   pa.large_list(
                       pa.struct(
                           [
                               (
                                   "field_1",
                                   pa.large_list(
                                       pa.struct(
                                           [
                                               pa.field(
                                                   "key",
                                                   
pa.large_list(pa.timestamp("us")),
                                                   nullable=False,
                                               ),
                                               ("value", 
pa.large_list(pa.int32())),
                                           ]
                                       )
                                   ),
                               ),
                               ("field_2", pa.int32()),
                               ("field_3", pa.string()),
                           ]
                       )
                   ),
               )
           ]
       ),
   )
   
   tbl.append(arrow_tbl)
   
   arrow_tbl_from_scan = tbl.scan().to_arrow()
   
   assert arrow_tbl.equals(arrow_tbl)  # OK
   # Fails:
   # assert arrow_tbl.equals(arrow_tbl_from_scan)
   
   print(
       pl.concat(
           [
               pl.DataFrame(arrow_tbl).rename({"column_1": 
"column_1_expected"}),
               pl.DataFrame(arrow_tbl_from_scan).rename({"column_1": 
"column_1_actual"}),
           ],
           how="horizontal",
       ).with_columns(
           
matching=pl.col("column_1_expected").eq_missing(pl.col("column_1_actual"))
       )
   )
   # shape: (5, 3)
   # 
┌─────────────────────────────────┬─────────────────────────────────┬──────────┐
   # │ column_1_expected               ┆ column_1_actual                 ┆ 
matching │
   # │ ---                             ┆ ---                             ┆ ---  
    │
   # │ list[struct[3]]                 ┆ list[struct[3]]                 ┆ bool 
    │
   # 
╞═════════════════════════════════╪═════════════════════════════════╪══════════╡
   # │ [{[{[2025-01-01 00:00:00, null… ┆ [{[{[2025-01-01 00:00:00, null… ┆ true 
    │
   # │ [{[{[2025-01-01 00:00:00, null… ┆ [{[{[2025-01-01 00:00:00, null… ┆ true 
    │
   # │ [{null,null,null}]              ┆ [{[],null,null}]                ┆ 
false    │
   # │ [null]                          ┆ [null]                          ┆ true 
    │
   # │ null                            ┆ []                              ┆ 
false    │
   # 
└─────────────────────────────────┴─────────────────────────────────┴──────────┘
   
   ```
   
   ### Issue description
   
   Outer NULLs become empty lists inside the Parquet file written by PyIceberg.
   
   Note that the reproducible example uses `tbl.append(<arrow table>)`, but 
this will affect `pl.DataFrame.write_iceberg(..)` in the same way.
   
   
   ### Expected behavior
   
   `column_1_actual == column_1_expected` for all rows in the example.
   
   
   ### Willingness to contribute
   
   - [ ] I can contribute a fix for this bug independently
   - [ ] I would be willing to contribute a fix for this bug with guidance from 
the Iceberg community
   - [ ] I cannot contribute a fix for this bug at this time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[I] NULL rows are written as empty lists on columns with multiple nesting levels [iceberg-python]

Reply via email to