kumarUjjawal opened a new issue, #22754:
URL: https://github.com/apache/datafusion/issues/22754
### Describe the bug
SlidingDistinctSumAccumulator (the accumulator used for SUM(DISTINCT) in
bounded/sliding window frames, added in #16943) has two related null-handling
bugs:
1. It iterates the raw values buffer, ignoring the validity mask. Both
update_batch (datafusion/functions-aggregate/src/sum.rs:539-549) and
retract_batch (sum.rs:595-608) do for
&v in arr.values(), which yields the physical value at every slot,
including NULL slots. Arrow does not guarantee the contents of the values
buffer at null positions, so a NULL
input row inserts an arbitrary physical value (commonly 0, but kernels can
leave non-zero garbage) into the distinct-counts map and adds it to the running
sum on first occurrence.
With non-zeroed null slots the sum can be arbitrarily wrong; even with
zeroed slots, NULL rows pollute the distinct key set with a spurious 0 key.
2. evaluate never returns NULL. sum.rs:552-554 returns
ScalarValue::Int64(Some(self.sum)) unconditionally, so even after fixing (1), a
frame containing only NULL rows would return
0 instead of NULL. (SUM(DISTINCT) of zero non-null values must be NULL;
the empty-frame case is already handled by the caller's default_value, but a
non-empty frame of all NULLs
reaches evaluate.)
For comparison, the non-DISTINCT sliding path (SlidingSumAccumulator,
sum.rs:468-) handles both cases correctly via compute::sum, which respects the
validity mask.
### To Reproduce
```SQL
-- DISTINCT sliding sum: WRONG — ts=1 frame contains only NULL, returns 0
SELECT ts, sum(DISTINCT v) OVER (ORDER BY ts ROWS BETWEEN 1 PRECEDING AND
CURRENT ROW) AS s
FROM (VALUES (1, CAST(NULL AS BIGINT)), (2, 3)) t(ts, v);
-- +----+---+
-- | 1 | 0 | <-- should be NULL
-- | 2 | 3 |
-- +----+---+
-- Non-DISTINCT sliding sum: correct
SELECT ts, sum(v) OVER (ORDER BY ts ROWS BETWEEN 1 PRECEDING AND CURRENT
ROW) AS s
FROM (VALUES (1, CAST(NULL AS BIGINT)), (2, 3)) t(ts, v);
-- +----+------+
-- | 1 | NULL |
-- | 2 | 3 |
-- +----+------+
-- Non-window DISTINCT sum: correct
SELECT sum(DISTINCT v) FROM (VALUES (CAST(NULL AS BIGINT))) t(v);
-- NULL
```
### Expected behavior
- NULL input rows are skipped by both update_batch and retract_batch
(consistent with SUM semantics and with the non-DISTINCT sliding accumulator).
- evaluate returns Int64(None) when no non-null value is in the frame
(e.g., when the counts map is empty).
### Additional context
_No response_
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]