mayankshriv commented on pull request #5872: URL: https://github.com/apache/incubator-pinot/pull/5872#issuecomment-676999527
> @mayankshriv Currently the `DistinctCount` is not having the expected behavior of returning the exact distinct count because it is storing the `hashCode()` of the values instead of the actual values, and will return less than accurate result when hash collision happens. We are fixing this unexpected behavior in this PR, but that has performance overhead. > `DistinctCountBitmap` will have the same behavior as the current `DistinctCount` (storing hash of the values) and similar or better performance. In case there are performance-sensitive use cases with `DistinctCount`, you might consider using `DistinctCountBitmap` instead. For non-performance-sensitive use cases, nothing need to be changed as `DistinctCount` will return the exact distinct count, which is the expected behavior. @Jackie-Jiang I completely agree that this is a good change for the functionality. I am just apprehensive that it can potentially have negative impact on performance, deployment (different versions of broker/server). It is not just latency performance, there's memory impact as well, right? There are likely various use cases that call `distinctCount` on string columns (in fact, that is the most common use case). Given that we are going to use String hashSets instead of Int hashSets, there's a memory penalty as well. This can be amplified for multi-tenant cases where a single distinct query on a string column can put memory pressure and adversely impact all tables on the node. For a large deployment, it is not practical to identify all use cases which may get impacted, manually evaluate the impact for each one of them, and ask the clients to move to the bit-map based aggregation function. For this reason, I was proposing to either have this as a new aggregation function, or provide a runtime way of disabling the new behavior in case issues happen in production. At the same time, I am fine to have this PR merged (as I fundamentally agree with the functional side of the change). However, I do want to call out that large deployments might likely need to invest in manually validating this PR for the various use cases they may have. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org