Re: [PR] Spark: Read/Write `UnknownType` [iceberg]

via GitHub Wed, 20 Aug 2025 16:41:40 -0700


rdblue commented on code in PR #13445:
URL: https://github.com/apache/iceberg/pull/13445#discussion_r2289533392



##########
spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/data/SparkParquetWriters.java:
##########
@@ -79,6 +84,123 @@ public static <T> ParquetValueWriter<T> 
buildWriter(StructType dfSchema, Message
         ParquetWithSparkSchemaVisitor.visit(dfSchema, type, new 
WriteBuilder(type));
   }
 
+  @SuppressWarnings("unchecked")
+  public static <T> ParquetValueWriter<T> buildWriter(Schema iSchema, 
MessageType type) {

Review Comment:
   I think there's a problem with this method because we need to know what the 
actual Spark type in the row is going to be in order to call the correct getter 
method on `InternalRow`.
   
   When Spark is writing a `short`, Iceberg converts the type to an `int`. That 
works fine most of the time, but some row implementations (I think `UnsafeRow`) 
will fail when you call `getInt` instead of `getShort`. That's why we were 
preserving the Spark type here and passing it into `ints`. By changing this to 
iterate over the Iceberg type and convert, we lose the information that the 
Spark representation is short and that it needs to be fetched that way. That 
happens on line 288.
   
   This is also a problem that @pvary has been hitting. I'll take a closer look 
and see if we should solve this some other way to unblock both.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Spark: Read/Write `UnknownType` [iceberg]

Reply via email to