Re: [PR] Spark: Read/Write `UnknownType` [iceberg]

via GitHub Thu, 14 Aug 2025 01:38:26 -0700


Fokko commented on code in PR #13445:
URL: https://github.com/apache/iceberg/pull/13445#discussion_r2275913790



##########
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/data/SparkParquetWriters.java:
##########
@@ -63,8 +64,9 @@ private SparkParquetWriters() {}
 
   @SuppressWarnings("unchecked")
   public static <T> ParquetValueWriter<T> buildWriter(StructType dfSchema, 
MessageType type) {
+    StructType writeSchema = PruneNullType.prune(dfSchema);
     return (ParquetValueWriter<T>)
-        ParquetWithSparkSchemaVisitor.visit(dfSchema, type, new 
WriteBuilder(type));
+        ParquetWithSparkSchemaVisitor.visit(writeSchema, type, new 
WriteBuilder(type));

Review Comment:
   > I think this issue raises a question about how this PR implements unknown 
handling by omitting a writer. To avoid the alignment issue with the fields in 
the actual InternalRow instances coming in, this needs to either skip the 
position or have a NoopWriter. I think the noop option is easiest, but I think 
the reason for pruning the StructType was probably to avoid failing in the 
visitor because the Spark type doesn't have the same number of fields as the 
Parquet type where unknowns have been removed 
([here](https://github.com/apache/iceberg/blob/main/spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/data/ParquetWithSparkSchemaVisitor.java#L186-L187)).
   
   That's exactly the issue:
   
   ```
   Structs do not match: 
   
   StructType(
       StructField(int,IntegerType,false),
       StructField(unk,NullType,true)
   )
   
   and
   
   optional group nested = 2 {
     required int32 int = 3;
   }
   
        at 
org.apache.iceberg.relocated.com.google.common.base.Preconditions.checkArgument(Preconditions.java:445)
        at 
org.apache.iceberg.spark.data.ParquetWithSparkSchemaVisitor.visitFields(ParquetWithSparkSchemaVisitor.java:177)
        at 
org.apache.iceberg.spark.data.ParquetWithSparkSchemaVisitor.visit(ParquetWithSparkSchemaVisitor.java:160)
        at 
org.apache.iceberg.spark.data.ParquetWithSparkSchemaVisitor.visitField(ParquetWithSparkSchemaVisitor.java:168)
        at 
org.apache.iceberg.spark.data.ParquetWithSparkSchemaVisitor.visitFields(ParquetWithSparkSchemaVisitor.java:188)
        at 
org.apache.iceberg.spark.data.ParquetWithSparkSchemaVisitor.visit(ParquetWithSparkSchemaVisitor.java:54)
        at 
org.apache.iceberg.spark.data.SparkParquetWriters.buildWriter(SparkParquetWriters.java:67)
        at 
org.apache.iceberg.spark.source.SparkFileWriterFactory.lambda$configureDataWrite$3(SparkFileWriterFactory.java:118)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Spark: Read/Write `UnknownType` [iceberg]

Reply via email to