Fokko commented on code in PR #13445:
URL: https://github.com/apache/iceberg/pull/13445#discussion_r2276781094
##########
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/data/SparkParquetWriters.java:
##########
@@ -63,8 +64,9 @@ private SparkParquetWriters() {}
@SuppressWarnings("unchecked")
public static <T> ParquetValueWriter<T> buildWriter(StructType dfSchema,
MessageType type) {
+ StructType writeSchema = PruneNullType.prune(dfSchema);
return (ParquetValueWriter<T>)
- ParquetWithSparkSchemaVisitor.visit(dfSchema, type, new
WriteBuilder(type));
+ ParquetWithSparkSchemaVisitor.visit(writeSchema, type, new
WriteBuilder(type));
Review Comment:
> That issue is a direct consequence of not representing unknown fields in
the Parquet type. Maybe we should rethink that decision and filter the Parquet
schema later, like when creating a Parquet file. For now, we can probably work
around the issue by updating the visitor logic to iterate over the fields and
account for missing `NullType`.
I visited that option as well, but I think we're too far down already. The
tests started failing because it starts to allocate Arrow buffers and it tries
to do metrics collection.
I understand the potential issue, let me try to reproduce it 👍
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]