Re: [PR] Add support for configuring aggregation function parameters in the star-tree index [pinot]

via GitHub Sun, 18 Aug 2024 05:29:50 -0700


yashmayya commented on code in PR #13835:
URL: https://github.com/apache/pinot/pull/13835#discussion_r1720962052



##########
pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/AggregationFunctionType.java:
##########
@@ -344,6 +346,55 @@ public static AggregationFunctionType 
getAggregationFunctionType(String function
     }
   }
 
+  public static int 
compareFunctionParametersForStarTree(AggregationFunctionType functionType,
+      Map<String, Object> configuration1, Map<String, Object> configuration2) {
+    switch (functionType) {
+      case DISTINCTCOUNTHLL:
+      case DISTINCTCOUNTRAWHLL: {
+        // Compare the log2m value accounting for defaults
+        if ((MapUtils.isEmpty(configuration1) || 
!configuration1.containsKey(Constants.HLL_LOG2M_KEY))
+            && (MapUtils.isEmpty(configuration2) || 
!configuration2.containsKey(Constants.HLL_LOG2M_KEY))) {
+          return 0;
+        }
+        if ((MapUtils.isEmpty(configuration1) || 
!configuration1.containsKey(Constants.HLL_LOG2M_KEY))) {
+          return 
Integer.compare(CommonConstants.Helix.DEFAULT_HYPERLOGLOG_LOG2M, 
Integer.parseInt(String.valueOf(
+              configuration2.get(Constants.HLL_LOG2M_KEY))));
+        }
+        if ((MapUtils.isEmpty(configuration2)) || 
!configuration2.containsKey(Constants.HLL_LOG2M_KEY)) {
+          return 
Integer.compare(Integer.parseInt(String.valueOf(configuration1.get(Constants.HLL_LOG2M_KEY))),
+              CommonConstants.Helix.DEFAULT_HYPERLOGLOG_LOG2M);
+        }
+        return 
Integer.compare(Integer.parseInt(String.valueOf(configuration1.get(Constants.HLL_LOG2M_KEY))),
+            
Integer.parseInt(String.valueOf(configuration2.get(Constants.HLL_LOG2M_KEY))));
+      }
+
+      case DISTINCTCOUNTHLLPLUS:
+      case DISTINCTCOUNTRAWHLLPLUS: {
+        // Only p value needs to be compared; HyperLogLogPlus with any sp 
value can be merged as long as p value is the
+        // same
+        if ((MapUtils.isEmpty(configuration1) || 
!configuration1.containsKey(Constants.HLLPLUS_ULL_P_KEY))
+            && (MapUtils.isEmpty(configuration2) || 
!configuration2.containsKey(Constants.HLLPLUS_ULL_P_KEY))) {
+          return 0;
+        }
+        if ((MapUtils.isEmpty(configuration1) || 
!configuration1.containsKey(Constants.HLLPLUS_ULL_P_KEY))) {
+          return 
Integer.compare(CommonConstants.Helix.DEFAULT_HYPERLOGLOG_PLUS_P,
+              
Integer.parseInt(String.valueOf(configuration2.get(Constants.HLLPLUS_ULL_P_KEY))));
+        }
+        if ((MapUtils.isEmpty(configuration2) || 
!configuration2.containsKey(Constants.HLLPLUS_ULL_P_KEY))) {
+          return 
Integer.compare(Integer.parseInt(String.valueOf(configuration1.get(Constants.HLLPLUS_ULL_P_KEY))),
+              CommonConstants.Helix.DEFAULT_HYPERLOGLOG_PLUS_P);
+        }
+        return 
Integer.compare(Integer.parseInt(String.valueOf(configuration1.get(Constants.HLLPLUS_ULL_P_KEY))),
+            
Integer.parseInt(String.valueOf(configuration2.get(Constants.HLLPLUS_ULL_P_KEY))));
+      }
+
+      // By default, assume that the rest of the aggregation functions do not 
have relevant configurations to be checked
+      // for a star-tree index
+      default:
+        return 0;

Review Comment:
   While this is definitely true for some aggregation functions like `COUNT` / 
`MIN` / `MAX` / `SUM` / `AVG` / `MINMAXRANGE` / `DISTINCTCOUNTBITMAP` / 
`PERCENTILEEST`, it's a little more nuanced for certain other aggregation 
functions like `DISTINCTCOUNTULL` and `SUMPRECISION` where if there is a 
mismatch in the query time aggregation parameter value (different `p` value for 
`DISTINCTCOUNTULL`, different precision value for `SUMPRECISION`) and the 
star-tree index aggregation parameter value. In these cases, there will never 
be a query error since intermediate results with any parameter value can be 
merged but the result could have lower / higher precision than desired by the 
user. On second thought, I'm wondering if it might just be better to do a 
stricter parameter match for these aggregation functions too.



##########
pinot-integration-tests/src/test/java/org/apache/pinot/integration/tests/StarTreeClusterIntegrationTest.java:
##########
@@ -234,6 +240,175 @@ private void testStarQuery(String starQuery, boolean 
verifyPlan)
         starQuery, starResponse, referenceQuery, referenceResponse, 
_randomSeed));
   }
 
+  @Test
+  public void testStarTreeWithDistinctCountHllConfigurations() throws 
Exception {
+    List<StarTreeIndexConfig> starTreeIndexConfigs = 
_tableConfig.getIndexingConfig().getStarTreeIndexConfigs();
+    StarTreeAggregationConfig aggregationConfig = new 
StarTreeAggregationConfig("OriginCityName", "DISTINCTCOUNTHLL",
+        Map.of(Constants.HLL_LOG2M_KEY, 4), null, null, null, null, null);
+
+    starTreeIndexConfigs.add(new 
StarTreeIndexConfig(Collections.singletonList("CRSDepTime"), null,
+        null, List.of(aggregationConfig), 1));
+    updateTableConfig(_tableConfig);
+
+    // Wait for table config to be updated
+    TestUtils.waitForCondition(
+        (aVoid) -> {
+          TableConfig tableConfig = getOfflineTableConfig();
+          return 
tableConfig.getIndexingConfig().getStarTreeIndexConfigs().size() == 
starTreeIndexConfigs.size();
+        }, 5_000L, "Failed to update table config"
+    );
+
+    reloadOfflineTable(DEFAULT_TABLE_NAME);
+
+    // Wait for the star-tree indexes to be built
+    TestUtils.waitForCondition(
+        (aVoid) -> {
+          JsonNode result;
+          try {
+            result = postQuery("EXPLAIN PLAN FOR SELECT 
DISTINCTCOUNTHLL(OriginCityName, 4) FROM mytable "
+                + "WHERE CRSDepTime = 35");
+          } catch (Exception e) {
+            throw new RuntimeException(e);
+          }
+          return result.toString().contains(FILTER_STARTREE_INDEX);
+        }, 1000L, 120_000L, "Failed to use star-tree index for query"

Review Comment:
   This is consistently failing in CI even with such a large timeout value 
whereas locally it consistently passes even with a timeout value as small as 10 
seconds. Looks like the segment reload and star-tree index building is taking 
excessively long in the CI environment 🤔 
   
   I'm not sure what the best way around this would be - potentially using a 
separate smaller table (lesser number of smaller segments) for this test 
perhaps?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

Re: [PR] Add support for configuring aggregation function parameters in the star-tree index [pinot]

Reply via email to