Re: [PR] Fix variant type filtering in ParquetMetricsRowGroupFilter [iceberg]

via GitHub Thu, 18 Sep 2025 14:39:18 -0700


amogh-jahagirdar commented on code in PR #14081:
URL: https://github.com/apache/iceberg/pull/14081#discussion_r2361248234



##########
data/src/test/java/org/apache/iceberg/data/TestMetricsRowGroupFilter.java:
##########
@@ -988,6 +1004,158 @@ public void testTransformFilter() {
         .isTrue();
   }
 
+  @TestTemplate
+  public void testVariantFilterNotNull() throws IOException {
+    assumeThat(format).isEqualTo(FileFormat.PARQUET);
+
+    Schema variantSchema =
+        new Schema(
+            required(1, "id", IntegerType.get()),
+            optional(2, "variant_field", Types.VariantType.get()));
+
+    File parquetFile = new File(tempDir, "test-variant" + System.nanoTime());
+
+    OutputFile outFile = Files.localOutput(parquetFile);
+    try (FileAppender<GenericRecord> appender =
+        Parquet.write(outFile)
+            .schema(variantSchema)
+            .createWriterFunc(GenericParquetWriter::create)
+            .build()) {
+
+      for (int i = 0; i < 10; i++) {
+        GenericRecord record = GenericRecord.create(variantSchema);
+        record.setField("id", i);
+
+        if (i % 2 == 0) {
+          VariantMetadata metadata = Variants.metadata("field");
+          ShreddedObject obj = Variants.object(metadata);
+          obj.put("field", Variants.of("value" + i));
+          Variant variant = Variant.of(metadata, obj);
+          record.setField("variant_field", variant);
+        }
+

Review Comment:
   Can we have a helper that just takes in the list of records and writes the 
parquet file:
   
   ```
   private File writeParquetFile(String namePrefix, Schema schema, 
List<GenericRecord> records) throws IOException {
       File parquetFile = new File(tempDir, namePrefix + System.nanoTime());
       OutputFile outFile = Files.localOutput(parquetFile);
       try (FileAppender<GenericRecord> appender =
               Parquet.write(outFile)
                      .schema(schema)
                      .createWriterFunc(GenericParquetWriter::create)
                      .build()) {
           for (GenericRecord record : records) {
               appender.add(record);
           }
       }
       parquetFile.deleteOnExit(); // Optional: ensure cleanup
       return parquetFile;
   }
   ```



##########
data/src/test/java/org/apache/iceberg/data/TestMetricsRowGroupFilter.java:
##########
@@ -988,6 +993,97 @@ public void testTransformFilter() {
         .isTrue();
   }
 
+  @TestTemplate
+  public void testVariantFilterNotNull() throws IOException {
+    assumeThat(format).isEqualTo(FileFormat.PARQUET);
+
+    Schema variantSchema =
+        new Schema(
+            required(1, "id", IntegerType.get()),
+            optional(2, "variant_field", Types.VariantType.get()));
+
+    File parquetFile = new File(tempDir, "test-variant" + System.nanoTime());
+
+    OutputFile outFile = Files.localOutput(parquetFile);
+    try (FileAppender<GenericRecord> appender =
+        Parquet.write(outFile)
+            .schema(variantSchema)
+            .createWriterFunc(GenericParquetWriter::create)
+            .build()) {
+
+      for (int i = 0; i < 10; i++) {
+        GenericRecord record = GenericRecord.create(variantSchema);
+        record.setField("id", i);
+
+        if (i % 2 == 0) {
+          VariantMetadata metadata = Variants.metadata("field");
+          ShreddedObject obj = Variants.object(metadata);
+          obj.put("field", Variants.of("value" + i));
+          Variant variant = Variant.of(metadata, obj);
+          record.setField("variant_field", variant);
+        }
+
+        appender.add(record);
+      }
+    }

Review Comment:
   Talked with @geruh offline and I also poked around refactoring this class, 
and while I think we should (this mixture of orc/parquet both trying to test 
"row group filtering" is leading to weird tests), it's a big change especially 
for something going into a patch release.
   
   Also while technically the implementation of filtering with variant doesn't 
depend on the actual contents of the file, after some more thought I concluded 
that it's better to write a more realistic test which does contain the records 
like @geruh was doing before. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Fix variant type filtering in ParquetMetricsRowGroupFilter [iceberg]

Reply via email to