[I] `add_files` raises `KeyError` if parquet file doe not have column stats [iceberg-python]

via GitHub Wed, 20 Nov 2024 17:27:42 -0800


binayakd opened a new issue, #1353:
URL: https://github.com/apache/iceberg-python/issues/1353


   ### Apache Iceberg version
   
   0.8.0 (latest release)
   
   ### Please describe the bug 🐞
   
   Using the NYC taxi data set found 
[here](https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-01.parquet),
 if I follow the standard way of creating catalog, and table, but instead of 
doing `append`, I do `add_files`:
   
   ```python
   from pyiceberg.catalog.sql import SqlCatalog
   import pyarrow.parquet as pq
   
   
   warehouse_path = "/tmp/warehouse"
   data_file_path = "/tmp/test-data" 
   
   catalog = SqlCatalog(
       "default",
       **{
           "uri": f"sqlite:///{warehouse_path}/pyiceberg_catalog.db",
           "warehouse": f"file://{warehouse_path}",
       }
   )
   
   df = pq.read_table(f"{data_file_path}/yellow_tripdata_2024-01.parquet")
   
   catalog.create_namespace("default")
   
   table = catalog.create_table(
       "default.taxi_dataset",
       schema=df.schema,
   )
   
   table.add_files([f"{data_file_path}/yellow_tripdata_2024-01.parquet"])
   ```
   I get a `KeyError`:
   
   ```
   Traceback (most recent call last):
     File "/home/binayak/Dropbox/dev/tests/iceberg-test/main.py", line 42, in 
<module>
       main()
     File "/home/binayak/Dropbox/dev/tests/iceberg-test/main.py", line 29, in 
main
       table.add_files([f"{data_file_path}/yellow_tripdata_2024-01.parquet"])
     File 
"/home/binayak/Dropbox/dev/my-github/iceberg-python/pyiceberg/table/__init__.py",
 line 1036, in add_files
       tx.add_files(
     File 
"/home/binayak/Dropbox/dev/my-github/iceberg-python/pyiceberg/table/__init__.py",
 line 594, in add_files
       for data_file in data_files:
     File 
"/home/binayak/Dropbox/dev/my-github/iceberg-python/pyiceberg/table/__init__.py",
 line 1537, in _parquet_files_to_data_files
       yield from parquet_files_to_data_files(io=io, 
table_metadata=table_metadata, file_paths=iter(file_paths))
     File 
"/home/binayak/Dropbox/dev/my-github/iceberg-python/pyiceberg/io/pyarrow.py", 
line 2535, in parquet_files_to_data_files
       statistics = data_file_statistics_from_parquet_metadata(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File 
"/home/binayak/Dropbox/dev/my-github/iceberg-python/pyiceberg/io/pyarrow.py", 
line 2400, in data_file_statistics_from_parquet_metadata
       del col_aggs[field_id]
           ~~~~~~~~^^^^^^^^^^
   KeyError: 1
   ```
   
   This is because since this parquet file does not have columns level stats 
sets, in the source code, it goes into the else block 
[here](https://github.com/apache/iceberg-python/blob/12e87a4fb6cc7891a80fd18c9367bffd78255271/pyiceberg/io/pyarrow.py#L2394)
   So col_aggs and null_value_counts is not updated, but invalidate_col is 
update. So when the del command is run 
[here](https://github.com/apache/iceberg-python/blob/12e87a4fb6cc7891a80fd18c9367bffd78255271/pyiceberg/io/pyarrow.py#L2400),
 the KeyError is thrown.
   
   As discussed on 
[slack](https://apache-iceberg.slack.com/archives/C029EE6HQ5D/p1732130926917969?thread_ts=1732091308.403089&cid=C029EE6HQ5D),
 @kevinjqliu proposed to switch `del col_aggs[field_id]` with 
`col_aggs.pop(field_id, None)`. 
   
   I will be raising a PR soon. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[I] `add_files` raises `KeyError` if parquet file doe not have column stats [iceberg-python]

Reply via email to