[GitHub] [incubator-pinot] mcvsubbu commented on pull request #5872: Support for exact distinct count for non int data types

GitBox Thu, 20 Aug 2020 08:42:34 -0700


mcvsubbu commented on pull request #5872:
URL: https://github.com/apache/incubator-pinot/pull/5872#issuecomment-677743083



   > > @mayankshriv Currently the `DistinctCount` is not having the expected 
behavior of returning the exact distinct count because it is storing the 
`hashCode()` of the values instead of the actual values, and will return less 
than accurate result when hash collision happens. We are fixing this unexpected 
behavior in this PR, but that has performance overhead.
   > > `DistinctCountBitmap` will have the same behavior as the current 
`DistinctCount` (storing hash of the values) and similar or better performance. 
In case there are performance-sensitive use cases with `DistinctCount`, you 
might consider using `DistinctCountBitmap` instead. For 
non-performance-sensitive use cases, nothing need to be changed as 
`DistinctCount` will return the exact distinct count, which is the expected 
behavior.
   > 
   > @Jackie-Jiang I completely agree that this is a good change for the 
functionality. I am just apprehensive that it can potentially have negative 
impact on performance, deployment (different versions of broker/server).
   > 
   > It is not just latency performance, there's memory impact as well, right? 
There are likely various use cases that call `distinctCount` on string columns 
(in fact, that is the most common use case). Given that we are going to use 
String hashSets instead of Int hashSets, there's a memory penalty as well. This 
can be amplified for multi-tenant cases where a single distinct query on a 
string column can put memory pressure and adversely impact all tables on the 
node.
   > 
   > For a large deployment, it is not practical to identify all use cases 
which may get impacted, manually evaluate the impact for each one of them, and 
ask the clients to move to the bit-map based aggregation function. For this 
reason, I was proposing to either have this as a new aggregation function, or 
provide a runtime way of disabling the new behavior in case issues happen in 
production.
   > 
   > At the same time, I am fine to have this PR merged (as I fundamentally 
agree with the functional side of the change). However, I do want to call out 
that large deployments might likely need to invest in manually validating this 
PR for the various use cases they may have.
   
   +1 on this. Site facing use cases breaking after deployment is not something 
to look forward to.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

[GitHub] [incubator-pinot] mcvsubbu commented on pull request #5872: Support for exact distinct count for non int data types

Reply via email to