tdunning edited a comment on pull request #7069:
URL: https://github.com/apache/incubator-pinot/pull/7069#issuecomment-864294396


   Errors in estimation with the t-digest are usually phrased in terms of 
quantile error (that is, q, not the samples x). Errors in the sample space are 
notionally unbounded because the sample distribution can be stretched. In 
addition, t-digest is normally used when estimating tails of a distribution so 
the relative error `error(q) / max(q, 1-q)` is controlled rather than the 
absolute error. The result is that error near the median can be orders of 
magnitude larger than near the extremes.
   
   For uniform distribution, errors in sample space are the same as errors in 
quantile (subject to scaling by the range).
   
   Regarding error scaling with respect to $q$, the figure below shows that for 
small values of `delta` errors scale roughly with `1/delta^2` and then 
transitions to scaling roughly with `1/sqrt(delta)` ([see this article on 
t-digest](https://www.sciencedirect.com/science/article/pii/S2665963820300403) 
for more detailed information). The problem with this statement is that it 
doesn't really respond to your question because it is, again, focused on 
extreme `q`.
   
   ![Scaling of error with compression factor 
delta](https://github.com/tdunning/t-digest/blob/main/docs/error-vs-compression.png?raw=true)
   
   Errors near the median for well-behaved distributions like the uniform 
distribution should be roughly like `2/delta` given that there should be about 
`delta/2` centroids (delta here is often called compression in the t-digest 
code). The actual results will be somewhat better than this to the extent that 
interpolation works. Here is a figure that shows max error of quantile(x) for 
selected values of x at various combinations of delta and number of samples. 
Note that these errors are relative to the exact empirical quantiles. Errors in 
consistency between different t-digests could, conceivably, be double these 
numbers. Also, if you looked at a high resolution scan of possible values of x, 
you would likely find bad cases right at the transition between 1 and 2 samples 
per centroid. 
   
   ![Errors as a function of delta and number of 
samples](https://raw.githubusercontent.com/tdunning/t-digest/main/docs/max-error-uniform.png)
   
   Basically, `delta=100` has *lots* fewer outlier cases and is likely to be 
what you need. I note that this is the default value for most of your code. 
This particular test stands out a bit for using `delta = 50`.
   
   This last figure is generated from the file `median-error.csv` produced by 
the `com.tdunning.tdigest.quality.CompareKllTest#testMedianErrorVersusScale` 
test in the main branch of t-digest. The code to produce the [figure can be 
found in this 
gist](https://gist.github.com/tdunning/78a4a8ee17b48d35cd7ec496b2f0d817).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

Reply via email to