[GitHub] [pinot] mqliang opened a new issue #7913: Increase DEFAULT_VALUE_PRUNER_IN_PREDICATE_THRESHOLD or derive the value with some simple policy

GitBox Wed, 15 Dec 2021 21:51:47 -0800


mqliang opened a new issue #7913:
URL: https://github.com/apache/pinot/issues/7913



   For IN predicate, `ColumnValueSegmentPruner` prune the segments based on:
   * Column min/max value
   * Column bloom filter
   
   However, we will skip pruning when there are too many values in the IN 
predicate, currently we skip pruning when IN predicates has more than 10 items:
   
https://github.com/apache/pinot/blob/12edbdb1e521e0d9ee35a2d6ef2fed172e5beaa1/pinot-spi/src/main/java/org/apache/pinot/spi/utils/CommonConstants.java#L287
   
   The value 10 here is quite empirical. IIUC, the rationale is: default FPP 
(false positive probability) of Bloom Filter is 
0.05:https://github.com/apache/pinot/blob/7e9ca6a5a4afe0d4e283ac1307c45430e474cbf2/pinot-spi/src/main/java/org/apache/pinot/spi/config/table/BloomFilterConfig.java#L28
   
   Which means even if a segments does not contains a given value,  BloomFilter 
will has a 5% probability falsely answering that the segment contains the 
value. 
   ```
   1 - 95% ^ 10 = 0.4
   ```
   So, if IN predicate has 10 values, even if a segment doe NOT contains any 
one of the 10 values, BloomFilter will has a 40% probability falsely answering 
that the segment can NOT be pruned. Let's say we have a query set:
   * all the queries are IN predicate queries
   * all the IN predicate has 10 values
   
   Say we have a table, the table does not contains any one of the 10 values of 
the IN predicate:
   
   * If the table only has 1 segment, the segment does not contains any one of 
the 10 values. BloomFilter will has a 40% probability falsely answering that 
the segment can NOT be pruned ---> We can see around 60% of our query super 
fast (as the segment get  pruned), and the other 40% query very slow (need load 
the segment into memory and do a full scanning)
   * If the table has 2 segments, then for each segments BloomFilter will has a 
40% probability falsely answering that the segment can NOT be pruned ---> we 
will only get 1 segments pruned --> it does not help too much in terms of 
decreasing query latency. 
   * If the table has more than 2 segments, it's very likely that at least 1 
segments can not be pruned --> it does not help too much in terms of reducing 
query latency. 
   
   I think that's the reason why BloomFilter does not help too much in terms of 
decreasing query latency when IN predicate has more than 10 values.
   
   But if we look at this problem from system resource usage point of view, a 
pinot segments is usually 100MB to 500MB, loading a segment form disk to memory 
and do a full segment scanning is much much more expensive than consulting 
BloomFilter (hundreds or thousands times expensive IMO).
   
   For example: If we assume loading a segment form disk to memory and do a 
full segment scanning is 100 times expensive than consulting BloomFilter, we 
should set DEFAULT_VALUE_PRUNER_IN_PREDICATE_THRESHOLD as 90 ( `1 - 95% ^ 90 = 
0.99`). Let me elaborate:
   * Say consulting BloomFilter 90 times takes 1K CPU cycles, and 1KB memory
   * Say loading a segment form disk to memory and do a full segment scanning 
need 100K CPU cycles, and 100KB memory. (100 times expensive than consulting 
BloomFilter)
   * Say we have 100 queries, each queries has IN predicate with 90 values. The 
table only has 1 segment and it does not contains any one of the 90 values at 
all.
   * If we enable BloomFilter pruning. We are able to prune the segment for 
only 1 query, for the other 99 queries, we are unable to prune. Total resource 
usage to executing the query: 1K + 99 * 1K + 99 * 100K =  10M CPU cycles and 
10MB memory 
   * If we disable BloomFilter pruning, total resource usage to executing the 
query: 100 * 100K = 10M CPU cycles and 10MB memory 
   * If  IN predicate has more than 90 values, enabling BloomFilter pruning 
will use more resource; 
   
   In conclusion, increasing `DEFAULT_VALUE_PRUNER_IN_PREDICATE_THRESHOLD` more 
than 10 does not help too much in terms of reducing query latency, but will be 
very helpful to decrease overall system resource usage.
   
   In addition, MinMaxRange pruning does not has the false positive problem as 
BloomFilter does, we may should always enabling MinMaxRange pruning.
   
   cc @mcvsubbu @siddharthteotia @GSharayu 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

[GitHub] [pinot] mqliang opened a new issue #7913: Increase DEFAULT_VALUE_PRUNER_IN_PREDICATE_THRESHOLD or derive the value with some simple policy

Reply via email to