mayankshriv commented on pull request #5872:
URL: https://github.com/apache/incubator-pinot/pull/5872#issuecomment-676999527


   > @mayankshriv Currently the `DistinctCount` is not having the expected 
behavior of returning the exact distinct count because it is storing the 
`hashCode()` of the values instead of the actual values, and will return less 
than accurate result when hash collision happens. We are fixing this unexpected 
behavior in this PR, but that has performance overhead.
   > `DistinctCountBitmap` will have the same behavior as the current 
`DistinctCount` (storing hash of the values) and similar or better performance. 
In case there are performance-sensitive use cases with `DistinctCount`, you 
might consider using `DistinctCountBitmap` instead. For 
non-performance-sensitive use cases, nothing need to be changed as 
`DistinctCount` will return the exact distinct count, which is the expected 
behavior.
   
   
   @Jackie-Jiang I completely agree that this is a good change for the 
functionality. I am just apprehensive that it can potentially have negative 
impact on performance, deployment (different versions of broker/server).
   
   It is not just latency performance, there's memory impact as well, right? 
There are likely various use cases that call `distinctCount` on string columns 
(in fact, that is the most common use case). Given that we are going to use 
String hashSets instead of Int hashSets, there's a memory penalty as well. This 
can be amplified for multi-tenant cases where a single distinct query on a 
string column can put memory pressure and adversely impact all tables on the 
node.
   
   For a large deployment, it is not practical to identify all use cases which 
may get impacted, manually evaluate the impact for each one of them, and ask 
the clients to move to the bit-map based aggregation function. For this reason, 
I was proposing to either have this as a new aggregation function, or provide a 
runtime way of disabling the new behavior in case issues happen in production.
   
   At the same time, I am fine to have this PR merged (as I fundamentally agree 
with the functional side of the change). However, I do want to call out that 
large deployments might likely need to invest in manually validating this PR 
for the various use cases they may have.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

Reply via email to