aryan-212 opened a new pull request, #21388:
URL: https://github.com/apache/datafusion/pull/21388
The interpolation step assumes centroids represent clusters of multiple
points. But if the number of input rows is small (≤ the digest's `max_size` /
compression threshold), **no compression ever happens**: every centroid has
weight 1 and corresponds to exactly one input value.
In that regime, interpolation is not just unnecessary — it is actively
**wrong**. The t-digest interpolates between adjacent centroids based on where
the rank falls *inside* the centroid's weight, using half-deltas to neighbors.
When every centroid has weight 1, this produces values that drift away from any
actual data point.
This is particularly surprising for users running small queries or unit
tests — they expect percentile functions on a handful of values to return one
of those values.
## Concrete Example
Lets take a small example from the TPCDS Schema
```sql
select cc_sq_ft from call_center;
```
none | cc_sq_ft
-- | --
1 | 6144
2 | 6144
3 | 19345
4 | 21156
5 | 21156
6 | 22743
7 | 34643
8 | 42935
9 | 52514
10 | 65772
11 | 76815
12 | 84336
13 | 105138
14 | 119886
Now if we take a small `APPROX_PERCENTILE` query like:-
```sql
select approx_percentile(cc_sq_ft,0.85) from call_center limit 50
```
From here, `0.85*14` yields 11.9 or 12 so the output for the above
`APPROX_PERCENITLE` query should be `84336` and that is what we get when we run
the same query in Databricks
<img width="1012" height="754" alt="Screenshot 2026-04-06 at 12 11 21 AM"
src="https://github.com/user-attachments/assets/00a158d5-ca96-4a0d-adc0-108bfae49214"
/>
But in Datafusion this comes up as
<img width="1130" height="234" alt="Screenshot 2026-04-06 at 12 12 21 AM"
src="https://github.com/user-attachments/assets/baaf9634-3e54-4b4a-86b3-3d230dd64397"
/>
This PR aims to fix this.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]