cccs-jc commented on issue #10029: URL: https://github.com/apache/iceberg/issues/10029#issuecomment-2017840320
This weekend I fixed the issue with the three row-group filters not working together. The results are quite impressive 11 seconds vs 396. ``` -----------------results-4CPU-OR-fixed------------------ ('R:128MB', 'SRC', 11.399899959564209) ('R:128MB', 'DST', 0.7222678661346436) ('R:128MB', 'AND', 0.6006526947021484) ('R:128MB', 'OR', 11.477725505828857) -----------------results-4CPU------------------ ('R:128MB', 'SRC', 13.441139459609985) ('R:128MB', 'DST', 1.1408600807189941) ('R:128MB', 'AND', 0.9586172103881836) ('R:128MB', 'OR', 396.9800181388855) ``` What I did is implement a new `ParquetCombinedRowGroupFilter` which takes the ParquetMetricsRowGroupFilter, ParquetDictionaryRowGroupFilter, ParquetBloomRowGroupFilter and applies them like so ```java @Override public <T> Boolean eq(BoundReference<T> ref, Literal<T> lit) { return visitors.stream().allMatch(v -> v.eq(ref, lit) == ROWS_MIGHT_MATCH); } ``` For every column it sequentially tests the metrics, dictionary and bloom if all of them return ROWS_MIGHT_MATCH then a shouldRead=True is returned. I still have to write a unit test to show that `OR` statement are applied properly now. I'll make a PR and we can compare notes. @zhongyujiang I'll have a look at your PR today. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org