Re: [PR] Parquet: Make row-group filters cooperate to filter [iceberg]

via GitHub Sat, 15 Jun 2024 09:43:30 -0700


amogh-jahagirdar commented on code in PR #10090:
URL: https://github.com/apache/iceberg/pull/10090#discussion_r1640408646



##########
parquet/src/main/java/org/apache/iceberg/parquet/ParquetDictionaryRowGroupFilter.java:
##########
@@ -116,48 +122,48 @@ private boolean eval(
         }
       }
 
-      return ExpressionVisitors.visitEvaluator(expr, this);
+      return ExpressionVisitors.visit(expr, this);
     }
 
     @Override
-    public Boolean alwaysTrue() {
-      return ROWS_MIGHT_MATCH; // all rows match
+    public Expression alwaysTrue() {
+      return ROWS_ALL_MATCH; // all rows match
     }
 
     @Override
-    public Boolean alwaysFalse() {
+    public Expression alwaysFalse() {
       return ROWS_CANNOT_MATCH; // all rows fail
     }
 
     @Override
-    public Boolean not(Boolean result) {
-      return !result;
+    public Expression not(Expression result) {
+      throw new UnsupportedOperationException("This path shouldn't be 
reached.");

Review Comment:
   I'm not really following why this throws UnsupportedOperationException? 
Should it have already been rewritten by `RewriteNot` or something? 



##########
parquet/src/test/java/org/apache/iceberg/parquet/TestBloomRowGroupFilter.java:
##########
@@ -1189,4 +1192,40 @@ public void testTransformFilter() {
         .as("Should read: filter contains non-reference evaluate as True")
         .isTrue();
   }
+
+  @Test
+  public void testParquetFindsResidual() {
+    // `string` col has no dict, this should be eliminated by bloom filter
+    Expression bloom = equal("string", "BINARY测试_301");
+    // should be eliminated by dictionary filter
+    Expression dict = equal("no_stats", "a");
+    // should be eliminated by bloom filter
+    Expression metric = greaterThan("id", INT_MAX_VALUE);
+    Expression expr = or(bloom, or(dict, metric));
+
+    Expression expected = Binder.bind(SCHEMA.asStruct(), Expressions.or(bloom, 
dict), true);
+    ParquetMetricsRowGroupFilter metricFilter = new 
ParquetMetricsRowGroupFilter(SCHEMA, expr);
+    Expression metricResidual = metricFilter.residualFor(parquetSchema, 
rowGroupMetadata);
+    assertThat(expected.isEquivalentTo(metricResidual))
+        .as("Expected residual: %s, actual residual: %s", expected, 
metricResidual);
+
+    expected = Binder.bind(SCHEMA.asStruct(), bloom, true);
+    ParquetDictionaryRowGroupFilter dictFilter =
+        new ParquetDictionaryRowGroupFilter(SCHEMA, metricResidual);
+    Expression dictResidual =
+        dictFilter.residualFor(
+            parquetSchema, rowGroupMetadata, 
reader.getDictionaryReader(rowGroupMetadata));
+
+    assertThat(expected.isEquivalentTo(dictResidual))
+        .as("Expected residual: %s, actual residual: %s", expected, 
dictResidual);
+
+    expected = Expressions.alwaysFalse();
+    ParquetBloomRowGroupFilter bloomFilter = new 
ParquetBloomRowGroupFilter(SCHEMA, dictResidual);
+    Expression bloomResidual =
+        bloomFilter.residualFor(
+            parquetSchema, rowGroupMetadata, 
reader.getBloomFilterDataReader(rowGroupMetadata));
+

Review Comment:
   It feels a bit odd to test the combined filter residual logic in this test 
method in `TestBloomRowGroupFilter`. See my comment above on a separate test 
class which encapsulates the logic, which should also make it easier for 
testing since we can then move this to a separate test class.
   
   As far as testing goes, here are the cases I think:
   
   1.) We know the existing tests should cover the cases where an individual 
filter is always true or always false.
   2.) So following 1, then we'd want the following tests for the combined 
tests:
         a.) Where the metrics filter has a residual that's always true/false
         b.) Where the metrics filter has a residual that's not true/false and 
the dictionary filter has one that is true/false.
         c.) Where the metrics filter has a residual that's not true/false, the 
dictionary filter does not have one that is true/false, and the bloom filter 
returns a residual that's not true/false. 
         d.) Same as c but the bloom filter does return true/false



##########
parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java:
##########
@@ -264,4 +254,31 @@ private List<Map<ColumnPath, ColumnChunkMetaData>> 
getColumnChunkMetadataForRowG
     }
     return listBuilder.build();
   }
+
+  private Expression findsResidual(

Review Comment:
   I think just the method name "residual" is sufficient.



##########
parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java:
##########
@@ -264,4 +254,31 @@ private List<Map<ColumnPath, ColumnChunkMetaData>> 
getColumnChunkMetadataForRowG
     }
     return listBuilder.build();
   }
+
+  private Expression findsResidual(
+      Expression expr,
+      Schema expectedSchema,
+      MessageType typeWithIds,
+      BlockMetaData rowGroup,
+      boolean caseSensitive) {
+    ParquetMetricsRowGroupFilter metricFilter =
+        new ParquetMetricsRowGroupFilter(expectedSchema, expr, caseSensitive);
+    Expression metricResidual = metricFilter.residualFor(typeWithIds, 
rowGroup);
+    if (metricResidual == Expressions.alwaysFalse() || metricResidual == 
Expressions.alwaysTrue()) {
+      return metricResidual;
+    }
+
+    ParquetDictionaryRowGroupFilter dictFilter =
+        new ParquetDictionaryRowGroupFilter(expectedSchema, metricResidual, 
caseSensitive);
+    Expression dictResidual =
+        dictFilter.residualFor(typeWithIds, rowGroup, 
reader.getDictionaryReader(rowGroup));
+    if (dictResidual == Expressions.alwaysFalse() || dictResidual == 
Expressions.alwaysTrue()) {
+      return dictResidual;
+    }
+
+    ParquetBloomRowGroupFilter bloomFilter =
+        new ParquetBloomRowGroupFilter(expectedSchema, dictResidual, 
caseSensitive);
+    return bloomFilter.residualFor(
+        typeWithIds, rowGroup, reader.getBloomFilterDataReader(rowGroup));

Review Comment:
   I feel like it would be helpful to define a `ParquetCombinedFilter` or 
something like that which encapsulates the metrics row group filter, the 
dictionary filter, and the bloom filter. It would also implement the 
residualFor API and It should also make it easier to test that class, and the 
combined behavior of these filters.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Parquet: Make row-group filters cooperate to filter [iceberg]

Reply via email to