binayakd opened a new issue, #1353: URL: https://github.com/apache/iceberg-python/issues/1353
### Apache Iceberg version 0.8.0 (latest release) ### Please describe the bug 🐞 Using the NYC taxi data set found [here](https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-01.parquet), if I follow the standard way of creating catalog, and table, but instead of doing `append`, I do `add_files`: ```python from pyiceberg.catalog.sql import SqlCatalog import pyarrow.parquet as pq warehouse_path = "/tmp/warehouse" data_file_path = "/tmp/test-data" catalog = SqlCatalog( "default", **{ "uri": f"sqlite:///{warehouse_path}/pyiceberg_catalog.db", "warehouse": f"file://{warehouse_path}", } ) df = pq.read_table(f"{data_file_path}/yellow_tripdata_2024-01.parquet") catalog.create_namespace("default") table = catalog.create_table( "default.taxi_dataset", schema=df.schema, ) table.add_files([f"{data_file_path}/yellow_tripdata_2024-01.parquet"]) ``` I get a `KeyError`: ``` Traceback (most recent call last): File "/home/binayak/Dropbox/dev/tests/iceberg-test/main.py", line 42, in <module> main() File "/home/binayak/Dropbox/dev/tests/iceberg-test/main.py", line 29, in main table.add_files([f"{data_file_path}/yellow_tripdata_2024-01.parquet"]) File "/home/binayak/Dropbox/dev/my-github/iceberg-python/pyiceberg/table/__init__.py", line 1036, in add_files tx.add_files( File "/home/binayak/Dropbox/dev/my-github/iceberg-python/pyiceberg/table/__init__.py", line 594, in add_files for data_file in data_files: File "/home/binayak/Dropbox/dev/my-github/iceberg-python/pyiceberg/table/__init__.py", line 1537, in _parquet_files_to_data_files yield from parquet_files_to_data_files(io=io, table_metadata=table_metadata, file_paths=iter(file_paths)) File "/home/binayak/Dropbox/dev/my-github/iceberg-python/pyiceberg/io/pyarrow.py", line 2535, in parquet_files_to_data_files statistics = data_file_statistics_from_parquet_metadata( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/binayak/Dropbox/dev/my-github/iceberg-python/pyiceberg/io/pyarrow.py", line 2400, in data_file_statistics_from_parquet_metadata del col_aggs[field_id] ~~~~~~~~^^^^^^^^^^ KeyError: 1 ``` This is because since this parquet file does not have columns level stats sets, in the source code, it goes into the else block [here](https://github.com/apache/iceberg-python/blob/12e87a4fb6cc7891a80fd18c9367bffd78255271/pyiceberg/io/pyarrow.py#L2394) So col_aggs and null_value_counts is not updated, but invalidate_col is update. So when the del command is run [here](https://github.com/apache/iceberg-python/blob/12e87a4fb6cc7891a80fd18c9367bffd78255271/pyiceberg/io/pyarrow.py#L2400), the KeyError is thrown. As discussed on [slack](https://apache-iceberg.slack.com/archives/C029EE6HQ5D/p1732130926917969?thread_ts=1732091308.403089&cid=C029EE6HQ5D), @kevinjqliu proposed to switch `del col_aggs[field_id]` with `col_aggs.pop(field_id, None)`. I will be raising a PR soon. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org