cgbur opened a new issue, #716:
URL: https://github.com/apache/iceberg-python/issues/716

   ### Apache Iceberg version
   
   main (development)
   
   ### Please describe the bug 🐞
   
   When using the `add_files` table api, the parquet metadata needs to be read 
and a mapping of `Dict[str, int]` is used by 
[`data_file_statistics_from_parquet_metadata`](https://github.com/apache/iceberg-python/blob/main/pyiceberg/io/pyarrow.py#L1670)
 in order to link the field ID to the name in the parquet file for statistics 
collection. However during [the mapping 
lookup](https://github.com/apache/iceberg-python/blob/main/pyiceberg/io/pyarrow.py#L1727)
 I was receiving an error that a key was not present.
   
   My schema contains one of the following (its a subfield of a `Details` 
struct which is important for the full name later):
   ```
   extras: large_list<item: struct<key: string not null, value: string>> not 
null
     child 0, item: struct<key: string not null, value: string>
         child 0, key: string not null
         child 1, value: string
   ```
   
   Which based on the parquet schema path definition has a path of: 
   ```
   Details.extras.list.item.key
   Details.extras.list.item.value
   ```
   
   The issue is that the 
[`parquet_path_to_id_mapping`](https://github.com/apache/iceberg-python/blob/main/pyiceberg/io/pyarrow.py#L1587)
 returns a mapping for these two fields as follows:
   ```
   Details.extras.list.element.key -> 189
   Details.extras.list.element.value -> 190
   ```
   
   So, the issue appears to be that the visitor for constructing the schema 
paths is incorrectly using `element` instead of `item` as expected in the 
parquet schema paths. I am not sure how this manifests yet, as I have not dug 
into it too closely.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to