Re: [PR] Spark: Add a test to check if the bloom filters are added to the parquet files [iceberg]

via GitHub Fri, 08 Mar 2024 17:33:47 -0800


amogh-jahagirdar commented on code in PR #9902:
URL: https://github.com/apache/iceberg/pull/9902#discussion_r1518451041



##########
spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkReaderWithBloomFilter.java:
##########
@@ -367,11 +374,28 @@ public void testReadWithFilter() {
             .filter(
                 "id = 250 AND id_long = 1250 AND id_double = 10250.0 AND 
id_float = 100250.0"
                     + " AND id_string = 'BINARY测试_250' AND id_boolean = true 
AND id_date = '2021-09-05'"
-                    + " AND id_int_decimal = 77.77 AND id_long_decimal = 88.88 
AND id_fixed_decimal = 99.99");
+                    + " AND id_int_decimal = 77.77 AND id_long_decimal = 88.88 
AND id_fixed_decimal = 99.99"
+                    + " AND id_nested.nested_id = 250");
 
     record = SparkValueConverter.convert(table.schema(), 
df.collectAsList().get(0));
 
     assertThat(df.collectAsList()).as("Table should contain 1 row").hasSize(1);
     assertThat(record.get(0)).as("Table should contain expected 
rows").isEqualTo(250);
   }
+
+  @TestTemplate
+  public void testBloomCreation() throws IOException {
+    org.apache.hadoop.fs.Path path = new 
org.apache.hadoop.fs.Path(temp.toString());
+    ParquetMetadata parquetMetadata = ParquetFileReader.readFooter(new 
Configuration(), path);
+    for(int i = 0; i < 11; i++) {
+      if (useBloomFilter)
+      {
+        
assertThat(parquetMetadata.getBlocks().get(0).getColumns().get(0).getBloomFilterOffset()).isNotEqualTo(-1L);
+      }
+      else
+      {
+          
assertThat(parquetMetadata.getBlocks().get(0).getColumns().get(0).getBloomFilterOffset()).isEqualTo(-1L);
+      }
+    }

Review Comment:
   I think this is great validation we should add, but I think in the Spark 
tests we should use the Spark APIs or Spark SQL to perform the write and then 
we run the validation to make sure the bloom filters exist. That should help 
catch the issue why we're not seeing bloom filters being written for nested 
types when writing via Spark (but going through `FileAppender` writes masks 
that)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Spark: Add a test to check if the bloom filters are added to the parquet files [iceberg]

Reply via email to