[I] Incorrect approx_percentile / approx_median results for small datasets due to unnecessary t-digest interpolation [datafusion]

via GitHub Thu, 09 Apr 2026 23:27:53 -0700


aryan-212 opened a new issue, #21528:
URL: https://github.com/apache/datafusion/issues/21528


   ### Describe the bug
   
   DataFusion’s `approx_percentile_cont` and `approx_median` use a t-digest 
internally. However, when the number of input rows is small enough that no 
centroid compression occurs, each centroid represents exactly one data point 
(weight = 1).
   
   In this scenario, the t-digest interpolation step is still applied, even 
though it assumes centroids represent clusters of multiple points. This leads 
to incorrect results that:
   
   - Do not match exact continuous percentiles (percentile_cont)
   - Do not match expected discrete nearest-rank semantics (as in Databricks)
   - Drift away from actual observed values
   
   This behavior is particularly surprising for small datasets and unit tests, 
where users expect percentile outputs to correspond to real data points.
   
   ### To Reproduce
   
   ### Steps to reproduce the behavior:
   
   #### Create a small dataset (e.g., TPC-DS call_center table with ~14 rows)
   
   Run:
   
   ```sql
   select approx_percentile(cc_sq_ft, 0.85) from call_center;
   ```
   
   Observe the output
   
   Another example:
   
   ```sql
   select approx_median(value)
   from (values (-85), (-72), (-56), (-48), (-43), (-25), (-12), (-5), (45), 
(83)) t(value);
   ```
   
   ### Expected behavior
   
   For small datasets where no t-digest compression occurs:
   
   Results should follow nearest-rank semantics (e.g., ceil(q * n) - 1)
   Returned values should be actual observed data points
   Behavior should align with systems like Databricks (percentile_approx)
   
   Examples:
   
   - `approx_percentile(cc_sq_ft, 0.85)` → 84336
   - `approx_median(...)` → should return a valid input value (not interpolated 
like -32)
   
   ### Additional context
   
   Currently, DataFusion applies t-digest interpolation even when:
   
   ```
   number_of_centroids == number_of_input_rows
   ```
   
   This indicates no compression has occurred, making interpolation both 
unnecessary and incorrect.
   
   The issue stems from using the interpolation path in estimate_quantile 
regardless of centroid structure. In the no-compression regime, the algorithm 
should instead switch to an exact nearest-rank computation.
   
   This inconsistency leads to unintuitive and incorrect results, especially in:
   
   Small datasets
   Window functions with small frame sizes
   Unit tests expecting deterministic outputs


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Incorrect approx_percentile / approx_median results for small datasets due to unnecessary t-digest interpolation [datafusion]

Reply via email to