CuteChuanChuan commented on PR #21988:
URL: https://github.com/apache/datafusion/pull/21988#issuecomment-4366624297

   Hi @alamb, thanks for raising the concern. I researched how PostgreSQL and 
DuckDB handle this, and added Spark / BigQuery / ClickHouse for completeness 
since the question is really "what's the cross-engine norm." Summary up front, 
then docs + reproducible tests.
   
   ## TL;DR
   
   Across all five engines I checked, `median` (or its closest equivalent) over 
an integer column **never returns the input integer type** — it always returns 
a floating-point / decimal type that preserves the `.5` from averaging.
   
   | Engine     | Has `median()`? | Return type for integer input | Truncates 
`.5`? |
   
|------------|-----------------|-------------------------------|------------------|
   | DuckDB     | Yes             | `DOUBLE`                      | No          
     |
   | PostgreSQL | No (uses `percentile_cont(0.5)`) | `double precision`  | No   
            |
   | Spark      | Yes             | floating point (docs example: `(0),(10)` → 
`5.0`) | No |
   | BigQuery   | No (uses `PERCENTILE_CONT`) | INT64 not even in supported 
types — must CAST to
   `NUMERIC`/`BIGNUMERIC`/`FLOAT64` | N/A by design |
   | ClickHouse | Yes (alias for `quantile`) | `Float64` | No |
   
   ## Per-engine evidence
   
   ### DuckDB — 
[docs](https://duckdb.org/docs/current/sql/functions/aggregates#medianx)
   > `median(x)` — The middle value of the set. **For even value counts, 
quantitative values are averaged** and ordinal values return the lower value.
   > Formula: `quantile_cont(x, 0.5)`
   
   Reproducible (DuckDB 1.5.2):
   ```bash
   duckdb -c "SELECT median(v) AS m, typeof(median(v)) AS rtype \
   FROM (VALUES (CAST(-128 AS TINYINT)), (CAST(127 AS TINYINT))) AS t(v);"
   # m: -0.5    rtype: DOUBLE
   ```
   
   ### PostgreSQL — 
[docs](https://www.postgresql.org/docs/current/functions-aggregate.html#FUNCTIONS-ORDEREDSET-TABLE)
   
   PostgreSQL has no built-in median; the wiki 
([Aggregate_Median](https://wiki.postgresql.org/wiki/Aggregate_Median)) 
recommends percentile_cont(0.5) WITHIN
   GROUP (ORDER BY ...), which is documented as returning double precision.
   
   Reproducible (PostgreSQL 17.9):
   ```bash
   docker run -d --name pg17 -e POSTGRES_PASSWORD=test postgres:17 && sleep 3
   docker exec pg17 psql -U postgres -c "\
   SELECT percentile_cont(0.5) WITHIN GROUP (ORDER BY v) AS m, \
          pg_typeof(percentile_cont(0.5) WITHIN GROUP (ORDER BY v)) AS rtype \
   FROM (VALUES (CAST(-128 AS smallint)), (CAST(127 AS smallint))) AS t(v);"
   # m: -0.5    rtype: double precision
   ```
   
   ### Spark — 
[docs](https://spark.apache.org/docs/latest/api/sql/index.html#median)
   Docs example for median(col): SELECT median(col) FROM VALUES (0), (10) AS 
tab(col);
   -- 5.0
   Integer input (0), (10) returns 5.0 (floating point) — even when the result 
happens to be a whole number.
   
   ### BigQuery — 
[docs](https://docs.cloud.google.com/bigquery/docs/reference/standard-sql/navigation_functions#percentile_cont)
   
   No built-in MEDIAN. PERCENTILE_CONT's Supported Argument Types are 
explicitly limited to NUMERIC, BIGNUMERIC, FLOAT64 — INT64 is not in the list. 
The return-type table never produces an integer type, so truncation is 
structurally impossible.
   
   ### ClickHouse — [median 
docs](https://clickhouse.com/docs/en/sql-reference/aggregate-functions/reference/median)
 / [quantile 
docs](https://clickhouse.com/docs/sql-reference/aggregate-functions/reference/quantile)
   
   ▎ The median* functions are the aliases for the corresponding quantile* 
functions.
   ▎ Returned value: Approximate quantile of the specified level. Float64 or 
Date or DateTime.
   
   Docs example with UInt32 input (1, 1, 2, 3) returns 1.5.
   
   
   ## On the downstream-disruption concern
   
   You're right that this is still a return-type breaking change, and the 
cross-engine evidence above tell us:
   1. Users coming from any of the five engines above already expect a 
floating-point return for them.
   2. Anyone relying on median(int_col) → int_col is depending on behavior that 
is non-portable across the rest of the ecosystem.
   3. Pure-DataFusion users who happen to have median(int_col) feeding a 
downstream integer-typed schema would still need to update their code.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to