richardstartin commented on issue #7437: URL: https://github.com/apache/pinot/issues/7437#issuecomment-924715637
I think it's important to discuss data distribution when considering bucket assignment by value. The Achilles heel of bucketing by value is skewed distributions, which might be produced by e.g. a bursty process leading to clustered timestamps, or maybe latencies in APM data, which tend to be multi-modal with huge outliers extending the range significantly (mean=1ms but p99.9=5s isn't unheard of). If you get, say, 80% of rows falling in a single bucket, evaluation time can't be logarithmic in the number of buckets, even if that limit holds for uniformly distributed data. You can ameliorate this for some distributions by using more buckets (but this means you need to merge more of them at query time), but it's always possible to construct some degenerate case where the buckets become imbalanced (e.g. consider artefacts of upstream processing where e.g. null has been collated to zero and a sparse set of values becomes dense with zeroes). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org