zhongyujiang opened a new pull request, #6893: URL: https://github.com/apache/iceberg/pull/6893
We found that Parquet row-group filters may not work well sometimes, specifically, when evaluating expressions connected by OR and if the child expressions of this OR expression can only be evaluated by different row-group filters. For example, suppose we have a sorted column `foo`, its null values are all clustered together after sorting,so queries like `foo IS NULL` can filter out most of the data. But when we want to combine other conditions to query, for example: `bar IN (x, y, z) OR foo IS NULL`(column `bar` is not sorted), row group filters can't work well, we found this is because that `ParquetMetricRowGroupFilter` has poor effect on evaluating `bar IN (x, y, z)` while at the same time `ParquetDictionaryRowGroupFilter` cannot answer `foo IS NULL` because Parquet dictionary has no nulls stats. I guess this also happens when one child node of OR can only be answered by `ParquetBloomRowGroupFilter` but the other can only be answered by `ParquetMetricRowGroupFilter` or `ParquetDictionaryRowGroupFilter`. This PR tries to solve this kind of issue. It borrows the idea of `ResidualEvaluator`, allowing row-group filters to eliminate those predicates that can get ROWS_CANNOT_MATCH / ROWS_ALL_MATCH conclusions during the evaluation process, so that an expression can be evaluated for residuals, which is then passed to the next row-group filter for evaluation. In this way, it makes three row-group filters to work together to evaluate an expression. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
