[I] Inconsistent PyArrow Schema Field Metadata on `project_table`: Parquet Field ID [iceberg-python]

via GitHub Sun, 02 Jun 2024 08:53:44 -0700


syun64 opened a new issue, #788:
URL: https://github.com/apache/iceberg-python/issues/788


   ### Apache Iceberg version
   
   None
   
   ### Please describe the bug 🐞
   
   While refactoring 
`project_table`(https://github.com/apache/iceberg-python/pull/786) I ran into 
some issues with the tests because the existing behavior for the 
`project_table` function isn’t consistent in terms of whether or not it returns 
the Parquet Field ID in its pyarrow schema field metadata.
   
   There are cases where the parquet field ID is attached to the field 
metadata, and cases where they aren’t: 
https://github.com/apache/iceberg-python/blob/main/tests/io/test_pyarrow.py#L1062-L1080
   
   I think this is because we use `schema_to_pyarrow` as a fallback schema 
which attaches the parquet field ID attribute onto the field metadata: 
https://github.com/apache/iceberg-python/blob/main/pyiceberg/io/pyarrow.py#L1133
   
   I think we should correct this behavior so that it is consistent for all 
table scans.
   
   - Do we want to attach the parquet file ID attribute on all pyarrow schema 
returned by `project_table`?
   - Or should we remove parquet file ID attached on the field metadata of the 
pyarrow schema? The idea here is that we would have two modes of creating 
`schema_to_pyarrow` , with or without parquet Field ID (write, versus read use 
cases)
   
   I think not having unintended metadata for a specific use case will be 
cleaner for the users. Parquet Field ID was added to `schema_to_pyarrow` so 
that we could persist the field ID into the parquet files on write. But we do 
not want them when we are reading the Table. Hence, I am leaning towards the 
second option. 
   
   Looking for some thoughts and direction on this issue so we can complete the 
refactoring to support `Iterator[RecordBatch]` output scans! @Fokko @HonahX 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[I] Inconsistent PyArrow Schema Field Metadata on `project_table`: Parquet Field ID [iceberg-python]

Reply via email to