JonasJ-ap opened a new pull request, #7523: URL: https://github.com/apache/iceberg/pull/7523
This PR follows #6997 In the previous PR, we supports infer iceberg schema from parquet files. However, the implemented visitor does not correctly differentiate between `UUIDType` and `FixedType(16)` since they both stored as `fixed_len_byte_array[16]` by [spec](https://iceberg.apache.org/spec/#primitive-types). This PR fix this issue by passing an optional `expected_schema` to the `pyarrow_to_iceberg` visitor so that the visitor can check if `UUIDType` is preferred during conversion. ## Example For table with schema like: ``` Current schema Schema, id=0 ├── 1: uuid: required uuid (c1) ├── 2: c1: required string (c1) ├── 3: struct1: required struct<5: c2: required uuid (c2), 6: c3: required string (c3)> └── 4: list: required list<struct<8: c4: required uuid (c4), 9: c5: required uuid (c5)>> ``` and optimized by AWS Athena's query: ``` OPTIMIZE uuid_test REWRITE DATA USING BIN_PACK; ``` Running the following query on the current master branch raises the following exception ```bash >>> table = catalog.load_table("iceberg_ref.uuid_test") >>> df = table. Scan().to_pandas() Error loading table uuid_test to pandas: Cannot promote fixed[16] to uuid ``` After this PR fix, we can successfully read the table: ```bash ============Loading table iceberg_ref.uuid_test to pandas =========== uuid c1 struct1 list 0 b'\xa8_\x9a\x0e\x14cM\x84\xb0\xe5\x83<!*W\xc4' text2 {'c2': b'\xcb\xb6\xee\x04\x94\x10E\xfd\x97\x16... [{'c4': b'\x07\x84\x05\xabv\xe5C\x1c\x86B\xe5F... 1 b'\xf2\x9b\xe7M\xd3\xe5G&\xae\xbf\x90Q\x85!~\x82' text2 {'c2': b"\xce\x87\xe7_''@\xd1\xa0{z\x1d\n\x1e\... [{'c4': b'\xc8\xc6\xeaF\xb9\xadL,\x856\x05Uj\x... 2 b'\xd1k^\xd0r\xe8J\xa4\xb4[\x16\x90\x88H\xb21' text1 {'c2': b'\x0b1\xd1\x07\x89\x90C\xa9\xa7\xc1\xf... [{'c4': b'\xb0\x1b\xce\x07\xa0\xe6DE\xae\x88O\... 3 b'P\xd6`\x10\xc1\xecO\x17\x991\x19N\xbcI-4' text1 {'c2': b'rC\x16\xf2\xbbYB\x93\xa1\x04\xa0\x99g... [{'c4': b'K\xfb\xfa\xa1{\x9fB\x0e\xaeu\xe2\xbb... 4 b'\x19R\xb9g"\xafH\x02\xbb\x1d\x14\xe9\xd5\xe9... Sample Text {'c2': b'\x8f\xf0z\xe0\x8b1N\x1a\xa7f\xb0BD\x8... [{'c4': b"^-\xdd\x87\x0ffD\x8a\x88'I\xad\x87\x... ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
