lakshmanan-v opened a new pull request #7383: URL: https://github.com/apache/pinot/pull/7383
## Description Adding DistinctCountHLLSketch andDistinctCountHLLPlusPlus to support HLLSketch and HLLPlusPlus algorithms for improve accuracy. **Issue:** https://github.com/apache/pinot/issues/7014 [HllSketch](http://datasketches.apache.org/docs/HLL/HLL.html) is based on Apache Datasketches. The following benchmark claims this implementation is better than HLLPlusPlus. https://datasketches.apache.org/docs/HLL/Hll_vs_CS_Hllpp.html - Added support for HLL++ as many developers are looking for HLLPlusPlus as well. In our unit tests, HLLPlusPlus gave answer close to actual cardinality of the dataset. - These aggregate functions support both single value and multi value aggregations. Both the functions are added with Raw versions as well. - There are multiple java implementations of HLLPlusPlus ([paper](https://research.google/pubs/pub40671.pdf)) exists. Clearspring library used for DistinctCountHLL offers [HLLPlusPlus](https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/cardinality/HyperLogLogPlus.java) as well. - Upgraded the datasketches library version. ## Upgrade Notes Does this PR prevent a zero down-time upgrade? (Assume upgrade order: Controller, Broker, Server, Minion) * [X ] Yes (Please label as **<code>backward-incompat</code>**, and complete the section below on Release Notes) Does this PR fix a zero-downtime upgrade introduced earlier? * [ ] Yes (Please label this as **<code>backward-incompat</code>**, and complete the section below on Release Notes) Does this PR otherwise need attention when creating release notes? Things to consider: - New configuration options - Deprecation of configurations - Signature changes to public methods/interfaces - New plugins added or old plugins removed * [ X] Yes (Please label this PR as **<code>release-notes</code>** and complete the section on Release Notes) ## Release Notes - New aggregate functions are introduced to support advanced HLL algorithms to improve the accuracy and speed of HyperLogLog algorithms. - DiscintCountHLLSketch supports Apache DataSketches HLLSketch. - DiscintCountHLLPlusPlus supports Google HLL++ implementation by clear spring. - Both the functions support single and multi value columns and offer serialized raw aggregate values. ## Documentation Will be creating a PR to update the documentation for the newly introduced aggregate functions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org