andygrove opened a new pull request, #4720: URL: https://github.com/apache/datafusion-comet/pull/4720
## Which issue does this PR close? Closes #2524. ## Rationale for this change `collect_list` and its alias `array_agg` are common aggregate functions that previously fell back to Spark, breaking native execution for many real workloads (notably plans that group rows into arrays before further processing). Adding native support keeps these queries on the Comet path. ## What changes are included in this PR? This change was scaffolded with the `implement-comet-expression` skill. - Add `CollectList` message to `expr.proto` and a new `collectList = 18` arm in the `AggExpr` oneof. - Add `CometCollectList` serde in `aggregates.scala` and register `classOf[CollectList] -> CometCollectList` in `QueryPlanSerde`. - Wire the native side to `datafusion_spark::function::aggregate::collect::SparkCollectList`, the upstream Spark-compatible accumulator (Spark 3.4 through 4.1 use `ignore_nulls = true` semantics that match it). No Comet-local Rust function is added. - Extend the Partial-mode `adjustOutputForNativeState` fix in `operators.scala` to cover `CollectList` (same pattern already used for `CollectSet`: native produces `ArrayType(elementType)` while Spark declares the buffer as `BinaryType`). - Mark `collect_list` and `array_agg` as supported in the user-guide expression page. - Add an audit entry under `docs/source/contributor-guide/expression-audits/agg_funcs.md` covering Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1. ## How are these changes tested? - New SQL fixture `spark/src/test/resources/sql-tests/expressions/aggregate/collect_list.sql` exercises ~30 queries across types (boolean, byte/short/int/bigint, float/double including NaN/Inf/-0, string, binary, decimal up to (38,0), date, timestamp, struct, nested array), GROUP BY, NULLs, all-NULL groups, empty tables, single-row, mixed aggregates, multiple `collect_list` columns, DISTINCT, HAVING, the `array_agg` alias, INT/BIGINT boundary values, and an SPARK-17641 null-filter regression. The fixture runs under `ConfigMatrix: parquet.enable.dictionary=false,true`, so each query executes twice. - All 21 aggregate `CometSqlFileTestSuite` tests pass (`./mvnw test -Dsuites="org.apache.comet.CometSqlFileTestSuite expressions/aggregate" -Dtest=none -Pspark-3.5`), confirming no regression to `collect_set` or other aggregates. - `cargo clippy --all-targets --workspace -- -D warnings` is clean. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
