richardstartin commented on issue #7437:
URL: https://github.com/apache/pinot/issues/7437#issuecomment-924715637


   I think it's important to discuss data distribution when considering bucket 
assignment by value. The Achilles heel of bucketing by value is skewed 
distributions, which might be produced by e.g. a bursty process leading to 
clustered timestamps, or maybe latencies in APM data, which tend to be 
multi-modal with huge outliers extending the range significantly (mean=1ms but 
p99.9=5s isn't unheard of). If you get, say, 80% of rows falling in a single 
bucket, evaluation time can't be logarithmic in the number of buckets, even if 
that limit holds for uniformly distributed data. You can ameliorate this for 
some distributions by using more buckets (but this means you need to merge more 
of them at query time), but it's always possible to construct some degenerate 
case where the buckets become imbalanced (e.g. consider artefacts of upstream 
processing where e.g. null has been collated to zero and a sparse set of values 
becomes dense with zeroes). 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

Reply via email to