[PR] feat: native collect_list / array_agg aggregate [datafusion-comet]

via GitHub Wed, 24 Jun 2026 09:27:03 -0700


andygrove opened a new pull request, #4720:
URL: https://github.com/apache/datafusion-comet/pull/4720


   ## Which issue does this PR close?
   
   Closes #2524.
   
   ## Rationale for this change
   
   `collect_list` and its alias `array_agg` are common aggregate functions that 
previously fell back to Spark, breaking native execution for many real 
workloads (notably plans that group rows into arrays before further 
processing). Adding native support keeps these queries on the Comet path.
   
   ## What changes are included in this PR?
   
   This change was scaffolded with the `implement-comet-expression` skill.
   
   - Add `CollectList` message to `expr.proto` and a new `collectList = 18` arm 
in the `AggExpr` oneof.
   - Add `CometCollectList` serde in `aggregates.scala` and register 
`classOf[CollectList] -> CometCollectList` in `QueryPlanSerde`.
   - Wire the native side to 
`datafusion_spark::function::aggregate::collect::SparkCollectList`, the 
upstream Spark-compatible accumulator (Spark 3.4 through 4.1 use `ignore_nulls 
= true` semantics that match it). No Comet-local Rust function is added.
   - Extend the Partial-mode `adjustOutputForNativeState` fix in 
`operators.scala` to cover `CollectList` (same pattern already used for 
`CollectSet`: native produces `ArrayType(elementType)` while Spark declares the 
buffer as `BinaryType`).
   - Mark `collect_list` and `array_agg` as supported in the user-guide 
expression page.
   - Add an audit entry under 
`docs/source/contributor-guide/expression-audits/agg_funcs.md` covering Spark 
3.4.3, 3.5.8, 4.0.1, 4.1.1.
   
   ## How are these changes tested?
   
   - New SQL fixture 
`spark/src/test/resources/sql-tests/expressions/aggregate/collect_list.sql` 
exercises ~30 queries across types (boolean, byte/short/int/bigint, 
float/double including NaN/Inf/-0, string, binary, decimal up to (38,0), date, 
timestamp, struct, nested array), GROUP BY, NULLs, all-NULL groups, empty 
tables, single-row, mixed aggregates, multiple `collect_list` columns, 
DISTINCT, HAVING, the `array_agg` alias, INT/BIGINT boundary values, and an 
SPARK-17641 null-filter regression. The fixture runs under `ConfigMatrix: 
parquet.enable.dictionary=false,true`, so each query executes twice.
   - All 21 aggregate `CometSqlFileTestSuite` tests pass (`./mvnw test 
-Dsuites="org.apache.comet.CometSqlFileTestSuite expressions/aggregate" 
-Dtest=none -Pspark-3.5`), confirming no regression to `collect_set` or other 
aggregates.
   - `cargo clippy --all-targets --workspace -- -D warnings` is clean.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] feat: native collect_list / array_agg aggregate [datafusion-comet]

Reply via email to