Re: [I] Bloom filter not properly leveraged when using an OR condition [iceberg]

2024-03-26 Thread via GitHub
zhongyujiang commented on issue #10029: URL: https://github.com/apache/iceberg/issues/10029#issuecomment-2020699122 >~~If we use ParquetCombinedRowGroupFilter, for certain expressions, even if the metric filter evaluates to false, the dict filter will still be invoked, resulting in addition

Re: [I] Bloom filter not properly leveraged when using an OR condition [iceberg]

2024-03-26 Thread via GitHub
cccs-jc commented on issue #10029: URL: https://github.com/apache/iceberg/issues/10029#issuecomment-2020343574 @amogh-jahagirdar I'm going to apply this patch to our internal deployment of Iceberg 1.5 and will likely run with it for a while. At the same time I will create a PR to the

Re: [I] Bloom filter not properly leveraged when using an OR condition [iceberg]

2024-03-25 Thread via GitHub
amogh-jahagirdar commented on issue #10029: URL: https://github.com/apache/iceberg/issues/10029#issuecomment-2018962330 I've been following this thread and after thinking about the proposed solution and going through the code a bit more, I think @cccs-jc approach is logically sound. This is

Re: [I] Bloom filter not properly leveraged when using an OR condition [iceberg]

2024-03-25 Thread via GitHub
cccs-jc commented on issue #10029: URL: https://github.com/apache/iceberg/issues/10029#issuecomment-2017840320 This weekend I fixed the issue with the three row-group filters not working together. The results are quite impressive 11 seconds vs 396. ``` -results-4CPU-OR-

Re: [I] Bloom filter not properly leveraged when using an OR condition [iceberg]

2024-03-25 Thread via GitHub
zhongyujiang commented on issue #10029: URL: https://github.com/apache/iceberg/issues/10029#issuecomment-2017638219 @cccs-jc @huaxingao I've met the same issue before. Because the three row-group filters cannot work together, some query expressions containing OR cannot filter data. I have d

Re: [I] Bloom filter not properly leveraged when using an OR condition [iceberg]

2024-03-23 Thread via GitHub
huaxingao commented on issue #10029: URL: https://github.com/apache/iceberg/issues/10029#issuecomment-2016577139 @cccs-jc Thanks for your proposal! For filter `col1=1 || col2=1`, the current implementation is: ``` shouldRead = statsFilter(col1=1 || col2=1) && dictFilter(col1=1 ||

Re: [I] Bloom filter not properly leveraged when using an OR condition [iceberg]

2024-03-23 Thread via GitHub
cccs-jc commented on issue #10029: URL: https://github.com/apache/iceberg/issues/10029#issuecomment-2016525638 @huaxingao You are absolutely correct; the issue arises also when combining the `statsFilter` with the `dictFilter`. It's essentially the same underlying problem. The crux o

Re: [I] Bloom filter not properly leveraged when using an OR condition [iceberg]

2024-03-22 Thread via GitHub
huaxingao commented on issue #10029: URL: https://github.com/apache/iceberg/issues/10029#issuecomment-2016304838 @cccs-jc Thanks a lot for your thorough investigation and analysis! The problem you described will also occur without a bloom filter. Let's use the where clause `col1=1 OR

[I] Bloom filter not properly leveraged when using an OR condition [iceberg]

2024-03-22 Thread via GitHub
cccs-jc opened a new issue, #10029: URL: https://github.com/apache/iceberg/issues/10029 ### Apache Iceberg version 1.4.3 ### Query engine Spark ### Please describe the bug 🐞 I'm testing a table of flow data with a schema of `SRC_IP long, DST_IP long`