cccs-jc opened a new issue, #10029: URL: https://github.com/apache/iceberg/issues/10029
### Apache Iceberg version 1.4.3 ### Query engine Spark ### Please describe the bug 🐞 I'm testing a table of flow data with a schema of `SRC_IP long, DST_IP long` I did a thorough investigation and there really seem to be a problem with bloom are used... Here I have done a search for a random IP value. I do it 10 times to get a precise average execution time. I use 512MB target file size and vary the row group size of from 128MB to 16MB. I also have 1 test done with files which have no blooms in them. All files are zordered by SRC and DST IP. The `where` clauses are `SRC_IP=val`, `DST_IP=val`, `SRC_IP=val AND DST_IP=val`, `SRC_IP=val OR DST_IP=val` ``` ('R:128MB', 'SRC', 4.070408272743225) ('R:64MB', 'SRC', 3.479648399353027) ('R:32MB', 'SRC', 7.69552972316742) ('R:16MB', 'SRC', 12.549865365028381) ('R:128MB no bloom', 'SRC', 17.634950709342956) ('R:128MB', 'DST', 5.119180655479431) ('R:64MB', 'DST', 5.0318292617797855) ('R:32MB', 'DST', 8.04975097179413) ('R:16MB', 'DST', 16.09592936038971) ('R:128MB no bloom', 'DST', 46.66901330947876) ('R:128MB', 'AND', 2.262153959274292) ('R:64MB', 'AND', 2.3894467115402223) ('R:32MB', 'AND', 4.230756330490112) ('R:16MB', 'AND', 8.178192615509033) ('R:128MB no bloom', 'AND', 6.790379118919373) ('R:128MB', 'OR', 72.06906585693359) ('R:64MB', 'OR', 34.00628011226654) ('R:32MB', 'OR', 25.374402904510497) ('R:16MB', 'OR', 24.86290364265442) ('R:128MB no bloom', 'OR', 101.38280701637268) ``` As you can see with 128MB row group query SRC task on average 4.07 seconds, DST takes 5.11 seconds. But doing an OR will take on average 72 seconds. this is in contrast with the no bloom scenario where it takes 17 seconds, 46 seconds and for the OR 101 seconds. Looking at the Iceberg code. I suspect the issue is as described: There are 3 types of expression evaluators: ParquetMetricsRowGroupFilter ParquetDictionaryRowGroupFilter ParquetBloomRowGroupFilter The ReadConf applies these 3 evaluators like so: https://github.com/apache/iceberg/blob/c9795fda7105789edc0d1f8a624ceb251dfd4a69/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L118 ``` boolean shouldRead = filter == null || (statsFilter.shouldRead(typeWithIds, rowGroup) && dictFilter.shouldRead( typeWithIds, rowGroup, reader.getDictionaryReader(rowGroup)) && bloomFilter.shouldRead( typeWithIds, rowGroup, reader.getBloomFilterDataReader(rowGroup))); ``` If any of the 3 evaluator say it's not possible for a particular value to be found in a file's rowgroup then that rowgroup will be skipped. Let's say the `where` clause is `SRC_IP=1 OR DST_IP=1` And let's suppose both columns are dictionary encoded. The `dictFilter` will determine that the value 1 is not in the dictionary of SRC_IP and not in the dictionary of DST_IP. So it returns `shouldRead=False`. However, if SRC_IP is dictionary encoded and DST_IP is not (it uses a bloom). Then the when `dictFilter` evaluates DST_IP=1, it returns `shouldRead=True` because there is no dictionary so it can't rule it out. It returns `shouldRead=True`. Conversely when the `bloomFilter` test SRC_IP=1, and determines there is no bloom on SRC_IP ( dictionary encoded), it returns `shouldRead=True` because again it can't rule it out. The result are combined by `ReadConf` and result in a `shouldRead=True`. Even though based on the dictionary of SRC_IP and the bloom information on DST_IP we should have skipped the rowgroup. But since the evaluation is done independently neither evaluator can make that decision. To prove my hypothesis I created a new dataset where I set the `write.parquet.dict-size-bytes` very low to make sure it does not use the dictionary encoding. Since the columns are not dictionary encoded they use a bloom filter. The results here show that when both SRC_IP and DST_IP columns are using bloom it is fast to evaluate the `OR` condition. 10 seconds compared to 379 seconds. ``` ('R:128MB no dict', 'SRC', 8.311380386352539) ('R:128MB', 'DST', 0.7780213356018066) ('R:128MB no dict', 'DST', 0.8464977741241455) ('R:128MB', 'AND', 0.5989606380462646) ('R:128MB no dict', 'AND', 0.7306745052337646) ('R:128MB', 'OR', 379.92671513557434) ('R:128MB no dict', 'OR', 10.23964548110962) ``` I propose that the ParquetDictionaryRowGroupFilter and ParquetBloomRowGroupFilter be combined into a single evaluator that test every column. Each column would be tested either using a dictionary or a bloom. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org