Re: [PR] Flink: Read parquet BINARY column as String for expected [iceberg]

via GitHub Tue, 17 Oct 2023 23:35:24 -0700


fengjiajie commented on PR #8808:
URL: https://github.com/apache/iceberg/pull/8808#issuecomment-1767764825


   > @fengjiajie: Checked the codepath for the Spark readers and I have 2 
questions:
   > 
   > * What about ORC and Avro files? Don't we have the same issue there?
   > * Would it worth to add the same fix for the Spark reader as well?
   
   @pvary  The issue is that the type of the read result should be determined 
by the column type definition in Iceberg, rather than the data type within the 
Parquet file.
   
   - It seems that ORC is not experiencing this issue because it creates value 
reader based on the iceberg column types.
   - Avro reads the fields entirely based on the file type, which seems to be 
problematic. However, it doesn't have significant issues under Parquet because 
Avro natively supports STRING and BYTES types, whereas Parquet only has the 
Binary type (whether the field is a String is determined by additional 
annotations or external metadata).
   - The implementation in Spark is similar to Flink, so it's possible that 
they both have the same issue.
   - Additionally, Iceberg has a UUID type, which seems to be supported in 
Spark but not in Flink: https://github.com/apache/iceberg/pull/7496


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Flink: Read parquet BINARY column as String for expected [iceberg]

Reply via email to