stevenzwu commented on code in PR #13445:
URL: https://github.com/apache/iceberg/pull/13445#discussion_r2283088004


##########
spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/data/SparkParquetWriters.java:
##########
@@ -79,6 +84,123 @@ public static <T> ParquetValueWriter<T> 
buildWriter(StructType dfSchema, Message
         ParquetWithSparkSchemaVisitor.visit(dfSchema, type, new 
WriteBuilder(type));
   }
 
+  @SuppressWarnings("unchecked")
+  public static <T> ParquetValueWriter<T> buildWriter(Schema iSchema, 
MessageType type) {

Review Comment:
   > The big downside is that we now also compute the Spark types in this 
visitor, which might be expensive.
   
   Conceptually, the existing `WriteBuilder` also needs to traverse the schema. 
so may not be much of a difference. Maybe we can run the jmh benchmark code so 
see the difference. but the benchmark currently runs for very small schema (a 
few to a dozen of fields). we may need to cover the larger schema scenario too.
   
   The other downside is the duplicates/similarities btw `WriteBuilder` and 
`IcebergWriteBuilder`. Maybe those two callers from the 
`SparkFileWriterFactory` can also switch to the Iceberg schema? Then we can 
remove/deprecate the other `buildWriter` method using Spark type.



##########
spark/v4.0/spark/src/test/java/org/apache/iceberg/spark/data/AvroDataTestBase.java:
##########
@@ -109,8 +110,8 @@ protected boolean supportsRowLineage() {
           required(114, "dec_9_0", Types.DecimalType.of(9, 0)), // int encoded
           required(115, "dec_11_2", Types.DecimalType.of(11, 2)), // long 
encoded
           required(116, "dec_20_5", Types.DecimalType.of(20, 5)), // requires 
padding
-          required(117, "dec_38_10", Types.DecimalType.of(38, 10)) // Spark's 
maximum precision
-          );
+          required(117, "dec_38_10", Types.DecimalType.of(38, 10)), // Spark's 
maximum precision
+          optional(118, "unk", Types.UnknownType.get()));

Review Comment:
   can we add the unknown type to the middle of the schema to cover the case 
that Ryan mentioned in one comment



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to