amogh-jahagirdar commented on code in PR #10090: URL: https://github.com/apache/iceberg/pull/10090#discussion_r1640408646
########## parquet/src/main/java/org/apache/iceberg/parquet/ParquetDictionaryRowGroupFilter.java: ########## @@ -116,48 +122,48 @@ private boolean eval( } } - return ExpressionVisitors.visitEvaluator(expr, this); + return ExpressionVisitors.visit(expr, this); } @Override - public Boolean alwaysTrue() { - return ROWS_MIGHT_MATCH; // all rows match + public Expression alwaysTrue() { + return ROWS_ALL_MATCH; // all rows match } @Override - public Boolean alwaysFalse() { + public Expression alwaysFalse() { return ROWS_CANNOT_MATCH; // all rows fail } @Override - public Boolean not(Boolean result) { - return !result; + public Expression not(Expression result) { + throw new UnsupportedOperationException("This path shouldn't be reached."); Review Comment: I'm not really following why this throws UnsupportedOperationException? Should it have already been rewritten by `RewriteNot` or something? ########## parquet/src/test/java/org/apache/iceberg/parquet/TestBloomRowGroupFilter.java: ########## @@ -1189,4 +1192,40 @@ public void testTransformFilter() { .as("Should read: filter contains non-reference evaluate as True") .isTrue(); } + + @Test + public void testParquetFindsResidual() { + // `string` col has no dict, this should be eliminated by bloom filter + Expression bloom = equal("string", "BINARY测试_301"); + // should be eliminated by dictionary filter + Expression dict = equal("no_stats", "a"); + // should be eliminated by bloom filter + Expression metric = greaterThan("id", INT_MAX_VALUE); + Expression expr = or(bloom, or(dict, metric)); + + Expression expected = Binder.bind(SCHEMA.asStruct(), Expressions.or(bloom, dict), true); + ParquetMetricsRowGroupFilter metricFilter = new ParquetMetricsRowGroupFilter(SCHEMA, expr); + Expression metricResidual = metricFilter.residualFor(parquetSchema, rowGroupMetadata); + assertThat(expected.isEquivalentTo(metricResidual)) + .as("Expected residual: %s, actual residual: %s", expected, metricResidual); + + expected = Binder.bind(SCHEMA.asStruct(), bloom, true); + ParquetDictionaryRowGroupFilter dictFilter = + new ParquetDictionaryRowGroupFilter(SCHEMA, metricResidual); + Expression dictResidual = + dictFilter.residualFor( + parquetSchema, rowGroupMetadata, reader.getDictionaryReader(rowGroupMetadata)); + + assertThat(expected.isEquivalentTo(dictResidual)) + .as("Expected residual: %s, actual residual: %s", expected, dictResidual); + + expected = Expressions.alwaysFalse(); + ParquetBloomRowGroupFilter bloomFilter = new ParquetBloomRowGroupFilter(SCHEMA, dictResidual); + Expression bloomResidual = + bloomFilter.residualFor( + parquetSchema, rowGroupMetadata, reader.getBloomFilterDataReader(rowGroupMetadata)); + Review Comment: It feels a bit odd to test the combined filter residual logic in this test method in `TestBloomRowGroupFilter`. See my comment above on a separate test class which encapsulates the logic, which should also make it easier for testing since we can then move this to a separate test class. As far as testing goes, here are the cases I think: 1.) We know the existing tests should cover the cases where an individual filter is always true or always false. 2.) So following 1, then we'd want the following tests for the combined tests: a.) Where the metrics filter has a residual that's always true/false b.) Where the metrics filter has a residual that's not true/false and the dictionary filter has one that is true/false. c.) Where the metrics filter has a residual that's not true/false, the dictionary filter does not have one that is true/false, and the bloom filter returns a residual that's not true/false. d.) Same as c but the bloom filter does return true/false ########## parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java: ########## @@ -264,4 +254,31 @@ private List<Map<ColumnPath, ColumnChunkMetaData>> getColumnChunkMetadataForRowG } return listBuilder.build(); } + + private Expression findsResidual( Review Comment: I think just the method name "residual" is sufficient. ########## parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java: ########## @@ -264,4 +254,31 @@ private List<Map<ColumnPath, ColumnChunkMetaData>> getColumnChunkMetadataForRowG } return listBuilder.build(); } + + private Expression findsResidual( + Expression expr, + Schema expectedSchema, + MessageType typeWithIds, + BlockMetaData rowGroup, + boolean caseSensitive) { + ParquetMetricsRowGroupFilter metricFilter = + new ParquetMetricsRowGroupFilter(expectedSchema, expr, caseSensitive); + Expression metricResidual = metricFilter.residualFor(typeWithIds, rowGroup); + if (metricResidual == Expressions.alwaysFalse() || metricResidual == Expressions.alwaysTrue()) { + return metricResidual; + } + + ParquetDictionaryRowGroupFilter dictFilter = + new ParquetDictionaryRowGroupFilter(expectedSchema, metricResidual, caseSensitive); + Expression dictResidual = + dictFilter.residualFor(typeWithIds, rowGroup, reader.getDictionaryReader(rowGroup)); + if (dictResidual == Expressions.alwaysFalse() || dictResidual == Expressions.alwaysTrue()) { + return dictResidual; + } + + ParquetBloomRowGroupFilter bloomFilter = + new ParquetBloomRowGroupFilter(expectedSchema, dictResidual, caseSensitive); + return bloomFilter.residualFor( + typeWithIds, rowGroup, reader.getBloomFilterDataReader(rowGroup)); Review Comment: I feel like it would be helpful to define a `ParquetCombinedFilter` or something like that which encapsulates the metrics row group filter, the dictionary filter, and the bloom filter. It would also implement the residualFor API and It should also make it easier to test that class, and the combined behavior of these filters. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org