kosiew opened a new issue, #22820:
URL: https://github.com/apache/datafusion/issues/22820
## Summary
`SUM(DISTINCT)` over sliding/bounded window frames is routed through the
normal `SUM` type-dispatch path, but the current sliding distinct accumulator
only supports `Int64`. This can produce confusing user-visible failures for
other `SUM` return types that DataFusion otherwise supports, such as unsigned
integers, floats, decimals, and durations.
This issue should make the supported type set explicit and reject
unsupported sliding `SUM(DISTINCT)` types with a clear error at the earliest
practical layer. Generalizing support to more types can be handled later as a
separate feature.
## Context
Relevant code:
- `datafusion/functions-aggregate/src/sum.rs`
- `Sum::create_sliding_accumulator`
- `SlidingDistinctSumAccumulator::try_new`
Current shape:
```rust
fn create_sliding_accumulator(
&self,
args: AccumulatorArgs,
) -> Result<Box<dyn Accumulator>> {
if args.is_distinct {
macro_rules! helper_distinct {
($t:ty, $dt:expr) => {
Ok(Box::new(SlidingDistinctSumAccumulator::try_new(&$dt)?))
};
}
downcast_sum!(args, helper_distinct)
} else {
// non-distinct sliding sum path
}
}
```
`downcast_sum!` accepts the supported `SUM` return types, including
`UInt64`, `Float64`, decimals, durations, and `Int64`. However, the distinct
sliding accumulator rejects everything except `Int64`:
```rust
pub fn try_new(data_type: &DataType) -> Result<Self> {
// TODO support other numeric types
if *data_type != DataType::Int64 {
return exec_err!("SlidingDistinctSumAccumulator only supports
Int64");
}
// ...
}
```
This means the dispatch contract and implementation contract are not aligned.
## Problem
From the SQL/user perspective, `SUM` supports more than `BIGINT`/`Int64`.
For non-window aggregates and non-distinct sliding sums, supported numeric
types work through the normal `SUM` path.
For sliding `SUM(DISTINCT)`, the broad dispatch path appears to accept all
supported `SUM` return types, then fails only when constructing the
accumulator. The error is implementation-oriented and does not clearly describe
the unsupported operation.
Example cases to cover or reject clearly:
```sql
-- float
SELECT SUM(DISTINCT f) OVER (
ORDER BY ts ROWS BETWEEN 1 PRECEDING AND CURRENT ROW
) FROM t;
-- decimal
SELECT SUM(DISTINCT d) OVER (
ORDER BY ts ROWS BETWEEN 1 PRECEDING AND CURRENT ROW
) FROM t;
-- unsigned-like values after coercion to UInt64, if reachable from SQL/types
SELECT SUM(DISTINCT u) OVER (
ORDER BY ts ROWS BETWEEN 1 PRECEDING AND CURRENT ROW
) FROM t;
-- duration values, if supported by SQL coercion/test setup
SELECT SUM(DISTINCT duration_col) OVER (
ORDER BY ts ROWS BETWEEN 1 PRECEDING AND CURRENT ROW
) FROM t;
```
## Desired behavior
Gate unsupported sliding `SUM(DISTINCT)` types explicitly.
Expected behavior:
- `Int64` sliding `SUM(DISTINCT)` keeps working.
- Unsupported types fail before or during physical accumulator construction
with a clear capability error.
- The error message names the operation and type, for example:
```text
SUM(DISTINCT) over sliding window frames is only supported for Int64, got
Decimal128(20, 2)
```
- Type dispatch should not imply support for types that are immediately
rejected by `SlidingDistinctSumAccumulator::try_new`.
## Suggested approach
- Avoid routing distinct sliding accumulators through the broad
`downcast_sum!` helper unless all those types are actually supported.
- Add a small dedicated helper for sliding distinct sum support, currently
accepting only `DataType::Int64`.
- Use that helper from `Sum::create_sliding_accumulator` for
`args.is_distinct`.
- Keep the unsupported-type check close to `create_sliding_accumulator` or
another physical-expression layer so the failure is tied to the window
aggregate capability, not the accumulator internals.
- Leave full support for floats, decimals, unsigned integers, and durations
to a future feature issue.
## Why this matters
This is a user-facing contract mismatch:
- `SUM` advertises/coerces multiple numeric return types.
- `create_sliding_accumulator` currently dispatches through the broad
supported-type path.
- The concrete sliding distinct accumulator only implements `Int64`.
The mismatch can cause confusing failures and makes future maintenance
harder because support appears broader at one layer than it is at another.
## Suggested tests
Add SQLLogicTests under `datafusion/sqllogictest/test_files/window.slt` or a
focused existing window test file.
Coverage:
- `SUM(DISTINCT)` sliding frame over `BIGINT` still succeeds, including
NULLs and duplicates.
- Unsupported type returns the intended clear error. Prefer one
SQL-reachable type such as `DOUBLE` or `DECIMAL`.
- If useful, add a Rust unit test for unsupported types that are hard to
express through SQL, such as `UInt64` or `Duration`.
## Acceptance criteria
- The supported type set for sliding `SUM(DISTINCT)` is explicit.
- `Int64` behavior, including NULL/empty-frame behavior, remains correct.
- Unsupported types produce a clear operation-specific error.
- No unsupported type fails only because
`SlidingDistinctSumAccumulator::try_new` rejects it after broad dispatch.
- Regression coverage exists for both the supported `Int64` path and at
least one unsupported-type error.
## Scope
This is a scoped correctness/refactor issue. It should not generalize
sliding `SUM(DISTINCT)` to new types. It should only make the current
`Int64`-only support explicit and user-facing errors clearer.
## Related PR
#22755
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]