cccs-jc commented on issue #10029: URL: https://github.com/apache/iceberg/issues/10029#issuecomment-2016525638
@huaxingao You are absolutely correct; the issue arises also when combining the `statsFilter` with the `dictFilter`. It's essentially the same underlying problem. The crux of the issue lies in the evaluation of conditions such as `col1=1 OR col2=1`, where we're layering the complete evaluation performed by `statsFilter`, with the evaluation from `dictFilter`, with the evaluation of `bloomFilter`. Ideally, for every column, we would want to combine the information from stats, dict and bloom. This sounds like a "Chain of Responsibility." So instead of the current implementation: ```java statsFilter = new ParquetMetricsRowGroupFilter(expectedSchema, filter, caseSensitive); dictFilter = new ParquetDictionaryRowGroupFilter(expectedSchema, filter, caseSensitive); bloomFilter = new ParquetBloomRowGroupFilter(expectedSchema, filter, caseSensitive); boolean shouldRead = filter == null || (statsFilter.shouldRead(typeWithIds, rowGroup) && dictFilter.shouldRead( typeWithIds, rowGroup, reader.getDictionaryReader(rowGroup)) && bloomFilter.shouldRead( typeWithIds, rowGroup, reader.getBloomFilterDataReader(rowGroup))); ``` I propose the following: ```java statsFilter = new ParquetMetricsRowGroupFilter(expectedSchema, filter, caseSensitive); dictFilter = new ParquetDictionaryRowGroupFilter(expectedSchema, filter, caseSensitive); bloomFilter = new ParquetBloomRowGroupFilter(expectedSchema, filter, caseSensitive); dictFilter.setDict(reader.getDictionaryReader(rowGroup)); bloomFilter.setBloom(reader.getBloomFilterDataReader(rowGroup)); // Chain statsFilter -> bloomFilter -> dictFilter bloomFilter.setNext(dictFilter); statsFilter.setNext(bloomFilter); boolean shouldRead = filter == null || statsFilter.shouldRead(typeWithIds, rowGroup); ``` In this setup, when the `statsFilter` evaluates an expression like `eq` and determines that `ROWS_CANNOT_MATCH`, it returns immediately. However, if it determines `ROWS_MIGHT_MATCH`, it passes on the request to the next filter (`bloomFilter`). The `bloomFilter` then tests against the bloom information. If it determines `ROWS_CANNOT_MATCH`, it returns immediately. But if it determines `ROWS_MIGHT_MATCH`, it then asks the next filter (`dictFilter`). This approach effectively combines all the information (metrics, bloom, dict) for a given `col1` and accurately determines whether `ROWS_CANNOT_MATCH` or `ROWS_MIGHT_MATCH` for `col1=1`. Basically the chain is like this: `statsFilter.eq(col1=1) -> bloomFilter.eq(col1=1) -> dictFilter.eq(col1=1)` It does the same with `col2=1`. Finally, the results for every column (`col1`, `col2`) can be combined using the `public Boolean or(Boolean leftResult, Boolean rightResult)` method. The `or` method does not need to be delegated since it does not tests against metrics, dict or bloom. Does this seem reasonable? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org