JonasJ-ap opened a new pull request, #7523:
URL: https://github.com/apache/iceberg/pull/7523

   This PR follows #6997 
   
   In the previous PR, we supports infer iceberg schema from parquet files. 
However, the implemented visitor does not correctly differentiate between 
`UUIDType` and `FixedType(16)` since they both stored as 
`fixed_len_byte_array[16]` by 
[spec](https://iceberg.apache.org/spec/#primitive-types). This PR fix this 
issue by passing an optional `expected_schema` to the `pyarrow_to_iceberg` 
visitor so that the visitor can check if `UUIDType` is preferred during 
conversion.
   
   ## Example
   For table with schema like:
   ```
   Current schema        Schema, id=0
                         ├── 1: uuid: required uuid (c1)
                         ├── 2: c1: required string (c1)
                         ├── 3: struct1: required struct<5: c2: required uuid 
(c2), 6: c3: required string (c3)>
                         └── 4: list: required list<struct<8: c4: required uuid 
(c4), 9: c5: required uuid (c5)>>
   ```
   and optimized by AWS Athena's query:
   ```
   OPTIMIZE uuid_test REWRITE DATA USING BIN_PACK;
   ```
   Running the following query on the current master branch raises the 
following exception
   ```bash
   >>> table = catalog.load_table("iceberg_ref.uuid_test")
   >>> df = table. Scan().to_pandas()
   Error loading table uuid_test to pandas: Cannot promote fixed[16] to uuid
   ```
   After this PR fix, we can successfully read the table:
   ```bash
   ============Loading table iceberg_ref.uuid_test to pandas ===========
                                                   uuid           c1            
                                struct1                                         
      list
   0     b'\xa8_\x9a\x0e\x14cM\x84\xb0\xe5\x83<!*W\xc4'        text2  {'c2': 
b'\xcb\xb6\xee\x04\x94\x10E\xfd\x97\x16...  [{'c4': 
b'\x07\x84\x05\xabv\xe5C\x1c\x86B\xe5F...
   1  b'\xf2\x9b\xe7M\xd3\xe5G&\xae\xbf\x90Q\x85!~\x82'        text2  {'c2': 
b"\xce\x87\xe7_''@\xd1\xa0{z\x1d\n\x1e\...  [{'c4': 
b'\xc8\xc6\xeaF\xb9\xadL,\x856\x05Uj\x...
   2     b'\xd1k^\xd0r\xe8J\xa4\xb4[\x16\x90\x88H\xb21'        text1  {'c2': 
b'\x0b1\xd1\x07\x89\x90C\xa9\xa7\xc1\xf...  [{'c4': 
b'\xb0\x1b\xce\x07\xa0\xe6DE\xae\x88O\...
   3        b'P\xd6`\x10\xc1\xecO\x17\x991\x19N\xbcI-4'        text1  {'c2': 
b'rC\x16\xf2\xbbYB\x93\xa1\x04\xa0\x99g...  [{'c4': 
b'K\xfb\xfa\xa1{\x9fB\x0e\xaeu\xe2\xbb...
   4  b'\x19R\xb9g"\xafH\x02\xbb\x1d\x14\xe9\xd5\xe9...  Sample Text  {'c2': 
b'\x8f\xf0z\xe0\x8b1N\x1a\xa7f\xb0BD\x8...  [{'c4': 
b"^-\xdd\x87\x0ffD\x8a\x88'I\xad\x87\x...
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to