Re: [PR] Flink: Read parquet BINARY column as String for expected [iceberg]

via GitHub Wed, 18 Oct 2023 04:55:50 -0700


pvary commented on PR #8808:
URL: https://github.com/apache/iceberg/pull/8808#issuecomment-1768291556


   > * It seems that ORC is not experiencing this issue because it creates 
value reader based on the iceberg column types.
   > * Avro reads the fields entirely based on the file type, which seems to be 
problematic. However, it doesn't have significant issues under Parquet because 
Avro natively supports STRING and BYTES types, whereas Parquet only has the 
Binary type (whether the field is a String is determined by additional 
annotations or external metadata).
   
   Thanks for the explanation!
   
   > * The data type read should be consistent with the iceberg column type, so 
I think Spark should also incorporate this modification.
   
   How hard would it be to incorporate this to the Spark reader as well?
   I am uncomfortable with these kind of fixes which are applied only to one of 
the engines.
   If it is not too complicated we should add it here, if not, then we need to 
create a different PR.
   
   > * Additionally, Iceberg has a UUID type, which seems to be supported in 
Spark but not in Flink: [Spark 3.3: Add read and write support for UUIDs 
#7496](https://github.com/apache/iceberg/pull/7496)
   
   I think this is a bigger nut to crack. Probably worth another PR in Flink to 
fix this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Flink: Read parquet BINARY column as String for expected [iceberg]

Reply via email to