lidavidm opened a new issue, #1766: URL: https://github.com/apache/iceberg-python/issues/1766
### Apache Iceberg version main (development) ### Please describe the bug 🐞 **This happens on pyiceberg 0.9.0** https://github.com/apache/iceberg-python/blob/e3a5c3b9bd4af5da46c4b7159367a60568e63023/pyiceberg/io/pyarrow.py#L1441 This set_column call always tries to add a 1-row column. But this is wrong (and PyArrow rejects it), it needs to add a column with the same length as the rest of the columns. <details> <summary>Reproducer</summary> ```python import os import tempfile import pyarrow.parquet import pyiceberg.io import pyiceberg.catalog import pyiceberg.catalog.memory import pyiceberg.schema schema = pyiceberg.schema.Schema( pyiceberg.schema.NestedField( field_id=1, name="o_orderkey", field_type=pyiceberg.schema.LongType(), required=False, ), pyiceberg.schema.NestedField( field_id=2, name="month", field_type=pyiceberg.schema.StringType(), required=False, ), schema_id=0, identifier_field_ids=[], ) partition_spec = pyiceberg.partitioning.PartitionSpec( pyiceberg.partitioning.PartitionField( source_id=2, field_id=1000, transform=pyiceberg.transforms.IdentityTransform(), name="month", ) ) with tempfile.TemporaryDirectory() as tmp_path: print("Warehouse in", tmp_path) session = pyiceberg.catalog.memory.InMemoryCatalog( "session", **{pyiceberg.io.WAREHOUSE: tmp_path}, ) session.create_namespace("session") table = pyarrow.table({"o_orderkey": [1, 2, 3]}) data_path = os.path.join(tmp_path, "orders.parquet") with open(data_path, "wb") as f: pyarrow.parquet.write_table(table, f) orders = session.create_table( identifier="session.orders", schema=schema, partition_spec=partition_spec, ) # Work around lack of native support for doing this (I may have missed something) data_files = list( pyiceberg.io.pyarrow.parquet_files_to_data_files( orders.io, orders.metadata, [data_path] ) ) for data_file in data_files: data_file.partition = pyiceberg.typedef.Record(month="1992-02") with orders.transaction() as tx: if tx.table_metadata.name_mapping() is None: default_name_mapping = tx.table_metadata.schema().name_mapping.model_dump_json() tx.set_properties( **{ pyiceberg.table.TableProperties.DEFAULT_NAME_MAPPING: default_name_mapping, } ) with tx.update_snapshot().fast_append() as update_snapshot: for data_file in data_files: update_snapshot.append_data_file(data_file) scan = orders.scan() print(scan.to_arrow()) ``` </details> <details> <summary>Output</summary> ``` Warehouse in /tmp/tmpy69j5uf6 Traceback (most recent call last): File "/home/lidavidm/Code/repro.py", line 80, in <module> print(scan.to_arrow()) ^^^^^^^^^^^^^^^ File "/home/lidavidm/Code/venv/lib/python3.12/site-packages/pyiceberg/table/__init__.py", line 1763, in to_arrow ).to_table(self.plan_files()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/lidavidm/Code/venv/lib/python3.12/site-packages/pyiceberg/io/pyarrow.py", line 1575, in to_table if table_result := future.result(): ^^^^^^^^^^^^^^^ File "/home/lidavidm/miniforge3/lib/python3.12/concurrent/futures/_base.py", line 449, in result return self.__get_result() ^^^^^^^^^^^^^^^^^^^ File "/home/lidavidm/miniforge3/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result raise self._exception File "/home/lidavidm/miniforge3/lib/python3.12/concurrent/futures/thread.py", line 59, in run result = self.fn(*self.args, **self.kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/lidavidm/Code/venv/lib/python3.12/site-packages/pyiceberg/io/pyarrow.py", line 1556, in _table_from_scan_task batches = list(self._record_batches_from_scan_tasks_and_deletes([task], deletes_per_file)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/lidavidm/Code/venv/lib/python3.12/site-packages/pyiceberg/io/pyarrow.py", line 1637, in _record_batches_from_scan_tasks_and_deletes for batch in batches: ^^^^^^^ File "/home/lidavidm/Code/venv/lib/python3.12/site-packages/pyiceberg/io/pyarrow.py", line 1441, in _task_to_record_batches result_batch = result_batch.set_column(index, name, [value]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "pyarrow/table.pxi", line 2969, in pyarrow.lib.RecordBatch.set_column File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: Added column's length must match record batch's length. Expected length 3 but got length 1 ``` </details> <details> <summary>venv</summary> ``` annotated-types==0.7.0 cachetools==5.5.2 certifi==2025.1.31 charset-normalizer==3.4.1 click==8.1.8 fsspec==2025.2.0 greenlet==3.1.1 idna==3.10 markdown-it-py==3.0.0 mdurl==0.1.2 mmh3==5.1.0 pyarrow==19.0.1 pydantic==2.10.6 pydantic_core==2.27.2 Pygments==2.19.1 pyiceberg==0.9.0 pyparsing==3.2.1 python-dateutil==2.9.0.post0 requests==2.32.3 rich==13.9.4 six==1.17.0 sortedcontainers==2.4.0 SQLAlchemy==2.0.38 strictyaml==1.7.3 tenacity==9.0.0 typing_extensions==4.12.2 urllib3==2.3.0 ``` </details> ### Willingness to contribute - [x] I can contribute a fix for this bug independently - [ ] I would be willing to contribute a fix for this bug with guidance from the Iceberg community - [ ] I cannot contribute a fix for this bug at this time -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org