fengjiajie commented on PR #8808: URL: https://github.com/apache/iceberg/pull/8808#issuecomment-1767764825
> @fengjiajie: Checked the codepath for the Spark readers and I have 2 questions: > > * What about ORC and Avro files? Don't we have the same issue there? > * Would it worth to add the same fix for the Spark reader as well? @pvary The issue is that the type of the read result should be determined by the column type definition in Iceberg, rather than the data type within the Parquet file. - It seems that ORC is not experiencing this issue because it creates value reader based on the iceberg column types. - Avro reads the fields entirely based on the file type, which seems to be problematic. However, it doesn't have significant issues under Parquet because Avro natively supports STRING and BYTES types, whereas Parquet only has the Binary type (whether the field is a String is determined by additional annotations or external metadata). - The implementation in Spark is similar to Flink, so it's possible that they both have the same issue. - Additionally, Iceberg has a UUID type, which seems to be supported in Spark but not in Flink: https://github.com/apache/iceberg/pull/7496 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org