asolimando commented on issue #21528: URL: https://github.com/apache/datafusion/issues/21528#issuecomment-4222057059
I don't think this is necessarily a bug, there are two types of semantics for similar approximate UDFs in SQL (you can see a similar description in Postgres' [doc](https://www.postgresql.org/docs/current/functions-aggregate.html#FUNCTIONS-ORDEREDSET-TABLE) for the non-approx functions): - discrete: where you are guaranteed to have output values from the universe of input values, - continuous: where any intermediate value can be used and expected. The `_cont` suffix in [`approx_percentile_cont`](https://github.com/apache/datafusion/blob/4389f14e70daa859c8ec41f9f09437c0b8e2bb55/datafusion/functions-aggregate/src/approx_percentile_cont.rs#L78) explicitly mirrors SQL's `percentile_cont` which is interpolating, and `approx_median` is [an alias for `approx_percentile_cont(0.5)`](https://github.com/apache/datafusion/blob/4389f14e70daa859c8ec41f9f09437c0b8e2bb55/datafusion/functions-aggregate/src/approx_median.rs#L49), so the same argument applies. Note also that the same behaviour applies to `percentile_cont`, that is not an approximation and doesn't even use t-digest, it's part of the contract of such functions. Switching between the two semantics based on the amount of values doesn't seem desirable, this is part of the contract of the function, independently from the values you feed to it. You cite determinism, and I agree that's a property we must preserve, but that's orthogonal to discrete vs continuous, and I believe that results are still deterministic even when compression kicks in, so unless you can reproduce a non-deterministic example, I'd remove that from the current discussion. What I suggest is to open an issue to introduce an `approx_percentile_disc` function and discuss the best way to keep the discrete behaviour even under approximation (possibly building on top `approx_percentile_cont` with a post-processing step to match the discrete semantics). WDYT? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
