Re: [PR] Fix: Arrow vectorized reads for TIMESTAMP_MILLIS and shaded JAR initialization [iceberg]

via GitHub Fri, 14 Nov 2025 02:29:27 -0800


nandorKollar commented on code in PR #14499:
URL: https://github.com/apache/iceberg/pull/14499#discussion_r2526913213



##########
spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/data/TestSparkParquetReader.java:
##########
@@ -232,6 +233,81 @@ protected WriteSupport<InternalRow> 
getWriteSupport(Configuration configuration)
     }
   }
 
+  @Test
+  public void testTimestampMillisProducedBySparkIsReadCorrectly() throws 
IOException {
+    String outputFilePath =
+        String.format("%s/%s", temp.toAbsolutePath(), 
"parquet_timestamp_millis.parquet");
+    HadoopOutputFile outputFile =
+        HadoopOutputFile.fromPath(
+            new org.apache.hadoop.fs.Path(outputFilePath), new 
Configuration());
+
+    Schema schema = new Schema(required(1, "event_time", 
Types.TimestampType.withZone()));
+
+    StructType sparkSchema =
+        new StructType(
+            new StructField[] {
+              new StructField("event_time", DataTypes.TimestampType, false, 
Metadata.empty())
+            });
+
+    List<InternalRow> originalRows = 
Lists.newArrayList(RandomData.generateSpark(schema, 10, 0L));
+    List<InternalRow> rows = Lists.newArrayList();
+    for (InternalRow row : originalRows) {
+      long timestampMicros = row.getLong(0);
+      long timestampMillis = (timestampMicros / 1000) * 1000;
+      rows.add(
+          new org.apache.spark.sql.catalyst.expressions.GenericInternalRow(
+              new Object[] {timestampMillis}));
+    }
+
+    try (ParquetWriter<InternalRow> writer =
+        new NativeSparkWriterBuilder(outputFile)
+            .set("org.apache.spark.sql.parquet.row.attributes", 
sparkSchema.json())
+            .set("spark.sql.parquet.writeLegacyFormat", "false")
+            .set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MILLIS")
+            .set("spark.sql.parquet.fieldId.write.enabled", "true")
+            .build()) {
+      for (InternalRow row : rows) {
+        writer.write(row);
+      }
+    }

Review Comment:
   @shubham19may it would be better not to rely on Spark module to cover this 
case with a test, but rather keep the the arrow module 'self-contained', cover 
it's functionality with tests there. 
[TestVariantReaders](https://github.com/apache/iceberg/blob/main/parquet/src/test/java/org/apache/iceberg/parquet/TestVariantReaders.java#L1377)
 already creates a custom Parquet writer with Avro object modell, it doesn't 
look too complicated. I'll try to put together an example with 
`ExampleParquetWriter`, it might be a bit more complicated. As Iceberg's 
Parquet writer doesn't write timestamp millis, unfortunately we need to 
implement our test writer for these types. Actually this is not the only case 
which is not covered, for example Parquet files with unsigned integer types are 
not covered either, probably there we can't even use the trick to test it in 
Spark, as unsigned types are unknown in Spark too.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Fix: Arrow vectorized reads for TIMESTAMP_MILLIS and shaded JAR initialization [iceberg]

Reply via email to