[I] Bloom filter not properly leveraged when using an OR condition [iceberg]

via GitHub Fri, 22 Mar 2024 13:49:33 -0700


cccs-jc opened a new issue, #10029:
URL: https://github.com/apache/iceberg/issues/10029


   ### Apache Iceberg version
   
   1.4.3
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   I'm testing a table of flow data with a schema of `SRC_IP long, DST_IP long`
   
   I did a thorough investigation and there really seem to be a problem with 
bloom are used...
   
   Here I have done a search for a random IP value. I do it 10 times to get a 
precise average execution time.
   
   I use 512MB target file size and vary the row group size of from 128MB to 
16MB. I also have 1 test done with files which have no blooms in them. All 
files are zordered by SRC and DST IP.
   
   The `where` clauses are `SRC_IP=val`, `DST_IP=val`, `SRC_IP=val AND 
DST_IP=val`, `SRC_IP=val OR DST_IP=val`
   
   ```
   ('R:128MB', 'SRC', 4.070408272743225)
   ('R:64MB', 'SRC', 3.479648399353027)
   ('R:32MB', 'SRC', 7.69552972316742)
   ('R:16MB', 'SRC', 12.549865365028381)
   ('R:128MB no bloom', 'SRC', 17.634950709342956)
   ('R:128MB', 'DST', 5.119180655479431)
   ('R:64MB', 'DST', 5.0318292617797855)
   ('R:32MB', 'DST', 8.04975097179413)
   ('R:16MB', 'DST', 16.09592936038971)
   ('R:128MB no bloom', 'DST', 46.66901330947876)
   ('R:128MB', 'AND', 2.262153959274292)
   ('R:64MB', 'AND', 2.3894467115402223)
   ('R:32MB', 'AND', 4.230756330490112)
   ('R:16MB', 'AND', 8.178192615509033)
   ('R:128MB no bloom', 'AND', 6.790379118919373)
   ('R:128MB', 'OR', 72.06906585693359)
   ('R:64MB', 'OR', 34.00628011226654)
   ('R:32MB', 'OR', 25.374402904510497)
   ('R:16MB', 'OR', 24.86290364265442)
   ('R:128MB no bloom', 'OR', 101.38280701637268)
   ```
   As you can see with 128MB row group query SRC task on average 4.07 seconds, 
DST takes 5.11 seconds. But doing an OR will take on average 72 seconds.
   this is in contrast with the no bloom scenario where it takes 17 seconds, 46 
seconds and for the OR 101 seconds.
   
   
   Looking at the Iceberg code. I suspect the issue is as described:
   
   There are 3 types of expression evaluators:
   ParquetMetricsRowGroupFilter
   ParquetDictionaryRowGroupFilter
   ParquetBloomRowGroupFilter
   
   The ReadConf applies these 3 evaluators like so:
   
https://github.com/apache/iceberg/blob/c9795fda7105789edc0d1f8a624ceb251dfd4a69/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L118
   
   ```
   boolean shouldRead =
             filter == null
                 || (statsFilter.shouldRead(typeWithIds, rowGroup)
                     && dictFilter.shouldRead(
                         typeWithIds, rowGroup, 
reader.getDictionaryReader(rowGroup))
                     && bloomFilter.shouldRead(
                         typeWithIds, rowGroup, 
reader.getBloomFilterDataReader(rowGroup)));
   ```                  
                 
   If any of the 3 evaluator say it's not possible for a particular value to be 
found in a file's rowgroup then that rowgroup will be skipped.
   
   Let's say the `where` clause is `SRC_IP=1 OR DST_IP=1`
   
   And let's suppose both columns are dictionary encoded.
   
   The `dictFilter` will determine that the value 1 is not in the dictionary of 
SRC_IP and not in the dictionary of DST_IP. So it returns `shouldRead=False`.
   
   However, if SRC_IP is dictionary encoded and DST_IP is not (it uses a bloom).
   
   Then the when `dictFilter` evaluates DST_IP=1, it returns `shouldRead=True` 
because there is no dictionary so it can't rule it out. It returns 
`shouldRead=True`.
   
   Conversely when the `bloomFilter` test SRC_IP=1, and determines there is no 
bloom on SRC_IP ( dictionary encoded), it returns `shouldRead=True` because 
again it can't rule it out.
   
   The result are combined by `ReadConf` and result in a `shouldRead=True`. 
Even though based on the dictionary of SRC_IP and the bloom information on 
DST_IP we should have skipped the rowgroup. But since the evaluation is done 
independently neither evaluator can make that decision.
   
   To prove my hypothesis I created a new dataset where I set the 
`write.parquet.dict-size-bytes` very low to make sure it does not use the 
dictionary encoding. Since the columns are not dictionary encoded they use a 
bloom filter.
   
   The results here show that when both SRC_IP and DST_IP columns are using 
bloom it is fast to evaluate the `OR` condition. 10 seconds compared to 379 
seconds.
   
   ```
   ('R:128MB no dict', 'SRC', 8.311380386352539)
   ('R:128MB', 'DST', 0.7780213356018066)
   ('R:128MB no dict', 'DST', 0.8464977741241455)
   ('R:128MB', 'AND', 0.5989606380462646)
   ('R:128MB no dict', 'AND', 0.7306745052337646)
   ('R:128MB', 'OR', 379.92671513557434)
   ('R:128MB no dict', 'OR', 10.23964548110962)
   ```
   
   I propose that the ParquetDictionaryRowGroupFilter and 
ParquetBloomRowGroupFilter be combined into a single evaluator that test every 
column. Each column would be tested either using a dictionary or a bloom.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[I] Bloom filter not properly leveraged when using an OR condition [iceberg]

Reply via email to