yashmayya opened a new pull request, #13835: URL: https://github.com/apache/pinot/pull/13835
- Currently, the star-tree index doesn’t support configurable inputs for aggregation functions. - Taking the [DISTINCTCOUNTHLL](https://docs.pinot.apache.org/configuration-reference/functions/distinctcounthll) function as an example, this means that there’s no way to provide the `log2m` parameter value in a star-tree index configuration and the star-tree will be created using the default value of 8. Furthermore, this isn't taken into account at query time. - So, if there is a star-tree index for a `DistinctCountHll` aggregation (with default `log2m` value of 8) on column `col`, and a user makes a query like `select DISTINCTCOUNTHLL(col, 16)...`, the query will still use the star-tree index. In the best case, this means that the query will return incorrect results (with lower than desired accuracy) if the aggregation query can be served wholly using the index itself. In the worst case however, when that’s not possible and additional aggregation is required, this leads to an error since `HyperLogLog`s with different `log2m` values can’t be merged - see https://github.com/apache/pinot/issues/12839. - This patch introduces a mechanism to allow configuring the aggregation function parameters for a star-tree index and also a mechanism to match query-time aggregation functions to only the appropriate star-tree index. - Unfortunately, this does require a lot of aggregation function specific logic to be introduced. For instance, `HyperLogLog`s with different `log2m` values can't be merged as pointed out above. However, two instances of `HyperLogLogPlus` with the same `p` value but different `sp` values can be merged. Instances of `UltraLogLog` with different `p` values can be merged, instances of `TDigest` with different compression factors can be merged and so on. - The star-tree index configuration's `aggregationConfigs` section now optionally takes in a `functionParameters` map to allow for a user-friendly way of configuring the star-tree index aggregation function parameters. For example: ``` { "starTreeIndexConfigs": [ { "dimensionsSplitOrder": [ "d1" ], "aggregationConfigs": [ { "columnName": "m1", "aggregationFunction": "DISTINCTCOUNTHLL", "functionParameters": { "log2m": 16 } } ] } ] } ``` ``` "starTreeIndexConfigs": [ { "dimensionsSplitOrder": [ "d1" ], "aggregationConfigs": [ { "columnName": "m1", "aggregationFunction": "DISTINCTCOUNTHLLPLUS", "functionParameters": { "p": 10, "sp": 20 } } ] } ] } ``` - It is now also possible to have multiple star-tree indexes for `DISTINCTCOUNTHLL` on the same column with different `log2m` values (and only the appropriate one will be used for every query). - Appropriate default value handling has also been added so that it isn't necessary to explicitly configure function parameters and also to ensure that older segments continue working as expected. Note that in older segments with star-tree indexes that were erroneously being used for a query like `select DISTINCTCOUNTHLL(col, 16)...`, this will no longer be the case (and the index will only be used for queries like `select DISTINCTCOUNTHLL(col, 8)...` or `select DISTINCTCOUNTHLL(col)...`). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org