lakshmanan-v opened a new pull request #7383:
URL: https://github.com/apache/pinot/pull/7383


   ## Description
   Adding DistinctCountHLLSketch andDistinctCountHLLPlusPlus to support 
HLLSketch and HLLPlusPlus algorithms for improve accuracy.
   
   **Issue:** https://github.com/apache/pinot/issues/7014
   
   [HllSketch](http://datasketches.apache.org/docs/HLL/HLL.html) is based on 
Apache Datasketches. The following benchmark claims this implementation is 
better than HLLPlusPlus. 
https://datasketches.apache.org/docs/HLL/Hll_vs_CS_Hllpp.html 
   
   - Added support for HLL++ as many developers are looking for HLLPlusPlus as 
well. In our unit tests, HLLPlusPlus gave answer close to actual cardinality of 
the dataset.
   - These aggregate functions support both single value and multi value 
aggregations. Both the functions are added with Raw versions as well. 
   - There are multiple java implementations of HLLPlusPlus 
([paper](https://research.google/pubs/pub40671.pdf)) exists. Clearspring 
library used for DistinctCountHLL offers 
[HLLPlusPlus](https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/cardinality/HyperLogLogPlus.java)
 as well. 
   - Upgraded the datasketches library version.
   
   ## Upgrade Notes
   Does this PR prevent a zero down-time upgrade? (Assume upgrade order: 
Controller, Broker, Server, Minion)
   * [X ] Yes (Please label as **<code>backward-incompat</code>**, and complete 
the section below on Release Notes)
   
   Does this PR fix a zero-downtime upgrade introduced earlier?
   * [ ] Yes (Please label this as **<code>backward-incompat</code>**, and 
complete the section below on Release Notes)
   
   Does this PR otherwise need attention when creating release notes? Things to 
consider:
   - New configuration options
   - Deprecation of configurations
   - Signature changes to public methods/interfaces
   - New plugins added or old plugins removed
   * [ X] Yes (Please label this PR as **<code>release-notes</code>** and 
complete the section on Release Notes)
   ## Release Notes
   - New aggregate functions are introduced to support advanced HLL algorithms 
to improve the accuracy and speed of HyperLogLog algorithms. 
   - DiscintCountHLLSketch supports Apache DataSketches HLLSketch.
   - DiscintCountHLLPlusPlus supports Google HLL++ implementation by clear 
spring.
   - Both the functions support single and multi value columns and offer 
serialized raw aggregate values.
   
   ## Documentation
   Will be creating a PR to update the documentation for the newly introduced 
aggregate functions. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

Reply via email to