This is an automated email from the ASF dual-hosted git repository.
wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push:
new f66c33618522 [SPARK-55501][SQL] Fix listagg distinct + within group
order by bug
f66c33618522 is described below
commit f66c3361852214d2adc35ed1dc4a1980057dba6f
Author: Helios He <[email protected]>
AuthorDate: Sat Feb 28 13:44:28 2026 +0800
[SPARK-55501][SQL] Fix listagg distinct + within group order by bug
### What changes were proposed in this pull request?
There is unexpected behaviour with listagg expression and using DISTINCT
and ORDER BY together, when the ORDER BY column is non-string/binary. The
ListAgg.child gets casted to a string type, and the CheckAnalyzer/Resolver
evaluates the child column as not semantically equal to the ORDER BY column.
This fails the query (since it believes the listagg.child column is not enough
to determine order, and could produce non-deterministic results). This is not
the expected behaviour, as the [...]
The fix I'm proposing is to loosen the restriction on the check by the
Analyzer/Resolver. We allow the listagg query to execute with DISTINCT + ORDER
BY as long as each child col (with GROUP BY equality) maps to 1 unique value in
the order expression col (they do not have to be the same. e.g. we could have
`child_col = order_col % 10`.
This PR only implements the loosened restriction for the case where the
child col is a cast of the order col, for which we need to also handle the
reverse direction; we need to ensure that the child col without casting is
semantically equal to the ORDER BY col and the cast is safe.
We follow this criteria to determine if a DataType can be safely casted to
StringType (no datatype is implicitly casted to BinaryType, so we can ignore):
For 2 values a,b, of DataType T:
1. if GROUP BY a = b, then CAST(a) = CAST(b)
2. if GROUP BY a != b, then CAST(a) != CAST(b)
We only consider the datatypes that can be casted to string (e.g. we ignore
complex datatypes like Array, Struct, Map).
The only 2 DataTypes that don't pass these criteria are DoubleType and
FloatType, since GROUP BY 0.0 = -0.0, but CAST(a as STRING) = "0.0" != CAST(b
as STRING) = "-0.0". This is because Double/Float are normalized before GROUP
BY, but not before casting.
Other numeric types are casted using `.toString()` or `toPlainString()`
which preserve precision/scale. Datetime/Interval types are converted with no
loss.
### Why are the changes needed?
It's a bug, as explained above.
### Does this PR introduce _any_ user-facing change?
Yes. Previous behaviour resulted in error:
Example query:
```
SELECT listagg(distinct col, ', ') within group (order by col)
FROM VALUES (3), (1), (100), (99), (1) t(col)
```
throws
`[INVALID_WITHIN_GROUP_EXPRESSION.MISMATCH_WITH_DISTINCT_INPUT] Invalid
function `listagg` with WITHIN GROUP. The function is invoked with DISTINCT and
WITHIN GROUP but expressions "col" and "col" do not match. The WITHIN GROUP
ordering expression must be picked from the function inputs. SQLSTATE: 42K0K;`
I'm proposing that this query (and similar ones) now pass with the result
`1, 3, 99, 100`.
We also throw a new error for when the child col is a cast on the order col
and is invalid:
`[INVALID_WITHIN_GROUP_EXPRESSION_MISMATCH_WITH_DISTINCT_INPUT_UNSAFE_CAST]`
It is a user-facing change compared to the released Spark versions.
### How was this patch tested?
Unit tests added to `DataFrameAggregateSuite`.
### Was this patch authored or co-authored using generative AI tooling?
Co-authored.
Generated-by: Claude v2.1.39
Closes #54297 from helioshe4/listagg-distinct-within-group-bug.
Authored-by: Helios He <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
---
.../src/main/resources/error/error-conditions.json | 5 +
.../sql/catalyst/analysis/CheckAnalysis.scala | 5 +-
.../resolver/AggregateExpressionResolver.scala | 18 +-
.../catalyst/expressions/aggregate/collect.scala | 157 ++++++++++++++++
.../spark/sql/errors/QueryCompilationErrors.scala | 12 ++
.../org/apache/spark/sql/internal/SQLConf.scala | 13 ++
.../analyzer-results/listagg-collations.sql.out | 69 +++++++
.../sql-tests/analyzer-results/listagg.sql.out | 207 ++++++++++++++++++++-
.../sql-tests/inputs/listagg-collations.sql | 16 ++
.../test/resources/sql-tests/inputs/listagg.sql | 28 ++-
.../sql-tests/results/listagg-collations.sql.out | 75 ++++++++
.../resources/sql-tests/results/listagg.sql.out | 201 +++++++++++++++++++-
.../apache/spark/sql/DataFrameAggregateSuite.scala | 25 +++
13 files changed, 801 insertions(+), 30 deletions(-)
diff --git a/common/utils/src/main/resources/error/error-conditions.json
b/common/utils/src/main/resources/error/error-conditions.json
index 52a21e43ea9e..b76e3b5c8d56 100644
--- a/common/utils/src/main/resources/error/error-conditions.json
+++ b/common/utils/src/main/resources/error/error-conditions.json
@@ -4296,6 +4296,11 @@
"The function is invoked with DISTINCT and WITHIN GROUP but
expressions <funcArg> and <orderingExpr> do not match. The WITHIN GROUP
ordering expression must be picked from the function inputs."
]
},
+ "MISMATCH_WITH_DISTINCT_INPUT_UNSAFE_CAST" : {
+ "message" : [
+ "The function <funcName> with DISTINCT and WITHIN GROUP (ORDER BY)
is not supported for <inputType> input. Explicitly cast the input to <castType>
before passing it to both the function argument and ORDER BY expression."
+ ]
+ },
"WITHIN_GROUP_MISSING" : {
"message" : [
"WITHIN GROUP is required for the function."
diff --git
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
index 7e134d7f23eb..c03ffa116716 100644
---
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
+++
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
@@ -449,9 +449,8 @@ trait CheckAnalysis extends LookupCatalog with
QueryErrorsBase with PlanToString
messageParameters = Map("funcName" -> toSQLExpr(w)))
case agg @ AggregateExpression(listAgg: ListAgg, _, _, _, _)
- if agg.isDistinct && listAgg.needSaveOrderValue =>
- throw
QueryCompilationErrors.functionAndOrderExpressionMismatchError(
- listAgg.prettyName, listAgg.child, listAgg.orderExpressions)
+ if agg.isDistinct && listAgg.hasDistinctOrderAmbiguity =>
+ listAgg.throwDistinctOrderError()
case w: WindowExpression =>
WindowResolution.validateResolvedWindowExpression(w)
diff --git
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/resolver/AggregateExpressionResolver.scala
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/resolver/AggregateExpressionResolver.scala
index b194a4e44a9d..d98b699349c9 100644
---
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/resolver/AggregateExpressionResolver.scala
+++
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/resolver/AggregateExpressionResolver.scala
@@ -78,8 +78,10 @@ class AggregateExpressionResolver(
* 2. Handle [[OuterReference]] in [[AggregateExpression]], if there are
any (see
* `handleOuterAggregateExpression`);
* - Validation:
- * 1. [[ListAgg]] is not allowed in DISTINCT aggregates if it contains
[[SortOrder]] different
- * from its child;
+ * 1. [[ListAgg]] is not allowed in DISTINCT aggregates if the order value
is ambiguous
+ * after deduplication. When
[[SQLConf.LISTAGG_ALLOW_DISTINCT_CAST_WITH_ORDER]] is
+ * enabled, a mismatch is tolerated if the child value uniquely
determines the order
+ * value (see [[ListAgg.hasDistinctOrderAmbiguity]]);
* 2. Nested aggregate functions are not allowed;
* 3. Nondeterministic expressions in the subtree of a related aggregate
function are not
* allowed;
@@ -116,8 +118,8 @@ class AggregateExpressionResolver(
private def validateResolvedAggregateExpression(aggregateExpression:
AggregateExpression): Unit =
aggregateExpression match {
case agg @ AggregateExpression(listAgg: ListAgg, _, _, _, _)
- if agg.isDistinct && listAgg.needSaveOrderValue =>
- throwFunctionAndOrderExpressionMismatchError(listAgg)
+ if agg.isDistinct && listAgg.hasDistinctOrderAmbiguity =>
+ listAgg.throwDistinctOrderError()
case _ =>
if (expressionResolutionContextStack.peek().hasAggregateExpressions) {
throwNestedAggregateFunction(aggregateExpression)
@@ -212,14 +214,6 @@ class AggregateExpressionResolver(
}
}
- private def throwFunctionAndOrderExpressionMismatchError(listAgg: ListAgg) =
{
- throw QueryCompilationErrors.functionAndOrderExpressionMismatchError(
- listAgg.prettyName,
- listAgg.child,
- listAgg.orderExpressions
- )
- }
-
private def throwNestedAggregateFunction(aggregateExpression:
AggregateExpression): Nothing = {
throw new AnalysisException(
errorClass = "NESTED_AGGREGATE_FUNCTION",
diff --git
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/collect.scala
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/collect.scala
index a864bf108436..8f6b80ebbe77 100644
---
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/collect.scala
+++
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/collect.scala
@@ -21,6 +21,7 @@ import scala.collection.mutable
import scala.collection.mutable.{ArrayBuffer, Growable}
import scala.util.{Left, Right}
+import org.apache.spark.SparkException
import org.apache.spark.sql.catalyst.InternalRow
import org.apache.spark.sql.catalyst.analysis.TypeCheckResult
import
org.apache.spark.sql.catalyst.analysis.TypeCheckResult.{DataTypeMismatch,
TypeCheckSuccess}
@@ -31,6 +32,7 @@ import org.apache.spark.sql.catalyst.util.{ArrayData,
GenericArrayData, TypeUtil
import org.apache.spark.sql.catalyst.util.TypeUtils.toSQLExpr
import org.apache.spark.sql.errors.{QueryCompilationErrors, QueryErrorsBase}
import org.apache.spark.sql.errors.DataTypeErrors.{toSQLId, toSQLType}
+import org.apache.spark.sql.internal.SQLConf
import org.apache.spark.sql.internal.types.StringTypeWithCollation
import org.apache.spark.sql.types._
import org.apache.spark.unsafe.types.{ByteArray, UTF8String}
@@ -592,6 +594,161 @@ case class ListAgg(
false
}
+ /**
+ * Returns true if the order value may be ambiguous after DISTINCT
deduplication.
+ *
+ * For LISTAGG(DISTINCT child) WITHIN GROUP (ORDER BY order_expr),
correctness requires
+ * a functional dependency (child -> order_expr, where equality is defined
by GROUP BY
+ * semantics): each distinct child value must map to exactly one order
value. Otherwise,
+ * after deduplication on child, the order value is ambiguous.
+ *
+ * When child = Cast(order_expr, T) where T is STRING or BINARY (LISTAGG's
accepted input
+ * types), the functional dependency (child -> order_expr) is satisfied
since casting to
+ * string/binary is injective for the types we allow. However, the cast must
also be
+ * equality-preserving (order_expr -> child): GROUP BY-equal order values
must produce
+ * GROUP BY-equal child values. Otherwise, the DISTINCT rewrite (which
groups by child) may
+ * split values that should be in the same group, causing over-counting.
+ * For example, Float/Double violate this because -0.0 and 0.0 are GROUP
BY-equal but cast
+ * to different strings. This is checked by [[isCastEqualityPreserving]] and
+ * [[isCastTargetEqualityPreserving]].
+ *
+ * Currently only detects these conditions for Cast.
+ * TODO(SPARK-55718): extend to detect other functional dependencies.
+ *
+ * Returns false when the order expression matches the child (i.e.,
[[needSaveOrderValue]]
+ * is false). Otherwise, the behavior depends on the
+ * [[SQLConf.LISTAGG_ALLOW_DISTINCT_CAST_WITH_ORDER]] config:
+ * - If enabled, delegates to [[checkOrderValueDeterminism]] to determine
whether the
+ * order value is uniquely determined by the child.
+ * - If disabled, any mismatch is considered ambiguous.
+ *
+ * @return true if ambiguity exists, false if the order value is
deterministic
+ * @see [[throwDistinctOrderError]] to throw the appropriate error when this
returns true
+ */
+ def hasDistinctOrderAmbiguity: Boolean = {
+ needSaveOrderValue && {
+ if (SQLConf.get.listaggAllowDistinctCastWithOrder) {
+ checkOrderValueDeterminism match {
+ case OrderDeterminismResult.Deterministic => false
+ case _ => true
+ }
+ } else {
+ true
+ }
+ }
+ }
+
+ def throwDistinctOrderError(): Nothing = {
+ if (SQLConf.get.listaggAllowDistinctCastWithOrder) {
+ checkOrderValueDeterminism match {
+ case OrderDeterminismResult.NonDeterministicMismatch =>
+ throwFunctionAndOrderExpressionMismatchError()
+ case OrderDeterminismResult.NonDeterministicCast(inputType, castType)
=>
+ throwFunctionAndOrderExpressionUnsafeCastError(inputType, castType)
+ case OrderDeterminismResult.Deterministic =>
+ throw SparkException.internalError(
+ "ListAgg.throwDistinctOrderError should not be called when the
cast is safe")
+ }
+ } else {
+ throwFunctionAndOrderExpressionMismatchError()
+ }
+ }
+
+ private def throwFunctionAndOrderExpressionMismatchError() = {
+ throw QueryCompilationErrors.functionAndOrderExpressionMismatchError(
+ prettyName, child, orderExpressions)
+ }
+
+ private def throwFunctionAndOrderExpressionUnsafeCastError(
+ inputType: DataType, castType: DataType) = {
+ throw QueryCompilationErrors.functionAndOrderExpressionUnsafeCastError(
+ prettyName, inputType, castType)
+ }
+
+ /**
+ * Checks whether the order value is uniquely determined by the child value.
+ *
+ * Currently only handles the case where child = Cast(order_expr, T). If the
cast is
+ * equality-preserving, the order value is deterministic (each child string
maps back to
+ * exactly one original value). Otherwise, returns
+ * [[OrderDeterminismResult.NonDeterministicMismatch]].
+ *
+ * @see [[hasDistinctOrderAmbiguity]]
+ */
+ private def checkOrderValueDeterminism: OrderDeterminismResult = {
+ if (orderExpressions.size != 1) return
OrderDeterminismResult.NonDeterministicMismatch
+ child match {
+ case Cast(castChild, castType, _, _)
+ if orderExpressions.head.child.semanticEquals(castChild) =>
+ if (isCastEqualityPreserving(castChild.dataType) &&
+ isCastTargetEqualityPreserving(castType)) {
+ OrderDeterminismResult.Deterministic
+ } else {
+ OrderDeterminismResult.NonDeterministicCast(castChild.dataType,
castType)
+ }
+ case _ => OrderDeterminismResult.NonDeterministicMismatch
+ }
+ }
+
+ /**
+ * Returns true if casting `dt` to string/binary preserves equality
semantics: values that
+ * are GROUP BY-equal must cast to equal results, and different results must
imply different
+ * original values. Types like Float/Double are unsafe because IEEE 754
negative zero (-0.0)
+ * and positive zero (0.0) are equal but produce different string
representations.
+ *
+ * @see [[checkOrderValueDeterminism]]
+ */
+ private def isCastEqualityPreserving(dt: DataType): Boolean = dt match {
+ case _: IntegerType | LongType | ShortType | ByteType => true
+ case _: DecimalType => true
+ case _: DateType | TimestampNTZType => true
+ case _: TimeType => true
+ case _: CalendarIntervalType => true
+ case _: YearMonthIntervalType => true
+ case _: DayTimeIntervalType => true
+ case BooleanType => true
+ case BinaryType => true
+ case st: StringType => st.isUTF8BinaryCollation
+ case _: DoubleType | FloatType => false
+ // During DST fall-back, two distinct UTC epochs can format to the same
local time string
+ // because the default format omits the timezone offset. TimestampNTZType
is safe (uses UTC).
+ case _: TimestampType => false
+ case _ => false
+ }
+
+ /**
+ * Returns true if the cast target type preserves equality semantics for
DISTINCT
+ * deduplication. A non-binary-equality collation on the target
[[StringType]] can cause
+ * different source values to become equal after casting (e.g., "ABC" and
"abc" are different
+ * under UTF8_BINARY but equal under UTF8_LCASE).
+ *
+ * @see [[checkOrderValueDeterminism]]
+ */
+ private def isCastTargetEqualityPreserving(dt: DataType): Boolean = dt match
{
+ case st: StringType => st.isUTF8BinaryCollation
+ case BinaryType => true
+ case _ => false
+ }
+
+ /**
+ * Result of checking whether the order value is uniquely determined by the
child value
+ * after DISTINCT deduplication. Currently only handles Cast.
+ */
+ private sealed trait OrderDeterminismResult
+
+ private object OrderDeterminismResult {
+ /** The order value is uniquely determined by the child value. */
+ case object Deterministic extends OrderDeterminismResult
+
+ /** Non-deterministic: cannot establish a child -> order functional
dependency. */
+ case object NonDeterministicMismatch extends OrderDeterminismResult
+
+ /** Non-deterministic: the cast does not preserve equality semantics. */
+ case class NonDeterministicCast(
+ inputType: DataType,
+ castType: DataType) extends OrderDeterminismResult
+ }
+
override protected def withNewChildrenInternal(newChildren:
IndexedSeq[Expression]): Expression =
copy(
child = newChildren.head,
diff --git
a/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala
b/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala
index a6174b82077b..8cdd734def4a 100644
---
a/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala
+++
b/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala
@@ -1116,6 +1116,18 @@ private[sql] object QueryCompilationErrors extends
QueryErrorsBase with Compilat
"orderingExpr" -> orderExpr.map(order =>
toSQLExpr(order.child)).mkString(", ")))
}
+ def functionAndOrderExpressionUnsafeCastError(
+ functionName: String,
+ inputType: DataType,
+ castType: DataType): Throwable = {
+ new AnalysisException(
+ errorClass =
"INVALID_WITHIN_GROUP_EXPRESSION.MISMATCH_WITH_DISTINCT_INPUT_UNSAFE_CAST",
+ messageParameters = Map(
+ "funcName" -> toSQLId(functionName),
+ "inputType" -> toSQLType(inputType),
+ "castType" -> toSQLType(castType)))
+ }
+
def wrongCommandForObjectTypeError(
operation: String,
requiredType: String,
diff --git
a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
index 1d59c7a08786..d5719a35cb36 100644
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
@@ -6916,6 +6916,17 @@ object SQLConf {
.booleanConf
.createWithDefault(false)
+ val LISTAGG_ALLOW_DISTINCT_CAST_WITH_ORDER =
+ buildConf("spark.sql.listagg.allowDistinctCastWithOrder.enabled")
+ .internal()
+ .doc("When true, LISTAGG(DISTINCT expr) WITHIN GROUP (ORDER BY expr) is
allowed on " +
+ "non-string expr when the implicit cast to string preserves equality
(e.g., integer, " +
+ "decimal, date). When false, the function argument and ORDER BY
expression must have " +
+ "the exact same type, which requires explicit casts.")
+ .version("4.2.0")
+ .booleanConf
+ .createWithDefault(true)
+
/**
* Holds information about keys that have been deprecated.
*
@@ -8144,6 +8155,8 @@ class SQLConf extends Serializable with Logging with
SqlApiConf {
def isTimeTypeEnabled: Boolean = getConf(SQLConf.TIME_TYPE_ENABLED)
+ def listaggAllowDistinctCastWithOrder: Boolean =
getConf(LISTAGG_ALLOW_DISTINCT_CAST_WITH_ORDER)
+
/** ********************** SQLConf functionality methods ************ */
/** Set Spark SQL configuration properties. */
diff --git
a/sql/core/src/test/resources/sql-tests/analyzer-results/listagg-collations.sql.out
b/sql/core/src/test/resources/sql-tests/analyzer-results/listagg-collations.sql.out
index 7a7d7aa3f047..ebf428339c7a 100644
---
a/sql/core/src/test/resources/sql-tests/analyzer-results/listagg-collations.sql.out
+++
b/sql/core/src/test/resources/sql-tests/analyzer-results/listagg-collations.sql.out
@@ -118,3 +118,72 @@ org.apache.spark.sql.catalyst.ExtendedAnalysisException
"orderingExpr" : "\"collate(c1, utf8_binary)\""
}
}
+
+
+-- !query
+SELECT listagg(DISTINCT CAST(col AS STRING)) WITHIN GROUP (ORDER BY col) FROM
VALUES ('ABC'), ('abc'), ('ABC') AS t(col)
+-- !query analysis
+Aggregate [listagg(distinct cast(col#x as string), null, col#x ASC NULLS
FIRST, 0, 0) AS listagg(DISTINCT CAST(col AS STRING), NULL) WITHIN GROUP (ORDER
BY col ASC NULLS FIRST)#x]
++- SubqueryAlias t
+ +- LocalRelation [col#x]
+
+
+-- !query
+SELECT listagg(DISTINCT CAST(col AS STRING COLLATE UTF8_LCASE)) WITHIN GROUP
(ORDER BY col) FROM VALUES ('ABC'), ('abc'), ('ABC') AS t(col)
+-- !query analysis
+org.apache.spark.sql.catalyst.ExtendedAnalysisException
+{
+ "errorClass" :
"INVALID_WITHIN_GROUP_EXPRESSION.MISMATCH_WITH_DISTINCT_INPUT_UNSAFE_CAST",
+ "sqlState" : "42K0K",
+ "messageParameters" : {
+ "castType" : "\"STRING COLLATE UTF8_LCASE\"",
+ "funcName" : "`listagg`",
+ "inputType" : "\"STRING\""
+ }
+}
+
+
+-- !query
+SELECT listagg(DISTINCT CAST(col AS STRING)) WITHIN GROUP (ORDER BY col) FROM
VALUES (X'414243'), (X'616263'), (X'414243') AS t(col)
+-- !query analysis
+Aggregate [listagg(distinct cast(col#x as string), null, col#x ASC NULLS
FIRST, 0, 0) AS listagg(DISTINCT CAST(col AS STRING), NULL) WITHIN GROUP (ORDER
BY col ASC NULLS FIRST)#x]
++- SubqueryAlias t
+ +- LocalRelation [col#x]
+
+
+-- !query
+SELECT listagg(DISTINCT CAST(col AS STRING COLLATE UTF8_LCASE)) WITHIN GROUP
(ORDER BY col) FROM VALUES (X'414243'), (X'616263'), (X'414243') AS t(col)
+-- !query analysis
+org.apache.spark.sql.catalyst.ExtendedAnalysisException
+{
+ "errorClass" :
"INVALID_WITHIN_GROUP_EXPRESSION.MISMATCH_WITH_DISTINCT_INPUT_UNSAFE_CAST",
+ "sqlState" : "42K0K",
+ "messageParameters" : {
+ "castType" : "\"STRING COLLATE UTF8_LCASE\"",
+ "funcName" : "`listagg`",
+ "inputType" : "\"BINARY\""
+ }
+}
+
+
+-- !query
+SELECT listagg(DISTINCT CAST(col AS BINARY)) WITHIN GROUP (ORDER BY col) FROM
VALUES ('ABC'), ('abc'), ('ABC') AS t(col)
+-- !query analysis
+Aggregate [listagg(distinct cast(col#x as binary), null, col#x ASC NULLS
FIRST, 0, 0) AS listagg(DISTINCT CAST(col AS BINARY), NULL) WITHIN GROUP (ORDER
BY col ASC NULLS FIRST)#x]
++- SubqueryAlias t
+ +- LocalRelation [col#x]
+
+
+-- !query
+SELECT listagg(DISTINCT CAST(col AS BINARY)) WITHIN GROUP (ORDER BY col) FROM
(SELECT col COLLATE UTF8_LCASE AS col FROM VALUES ('ABC'), ('abc'), ('ABC') AS
t(col))
+-- !query analysis
+org.apache.spark.sql.catalyst.ExtendedAnalysisException
+{
+ "errorClass" :
"INVALID_WITHIN_GROUP_EXPRESSION.MISMATCH_WITH_DISTINCT_INPUT_UNSAFE_CAST",
+ "sqlState" : "42K0K",
+ "messageParameters" : {
+ "castType" : "\"BINARY\"",
+ "funcName" : "`listagg`",
+ "inputType" : "\"STRING COLLATE UTF8_LCASE\""
+ }
+}
diff --git
a/sql/core/src/test/resources/sql-tests/analyzer-results/listagg.sql.out
b/sql/core/src/test/resources/sql-tests/analyzer-results/listagg.sql.out
index 1d78d882cf6a..146aa9ff6f52 100644
--- a/sql/core/src/test/resources/sql-tests/analyzer-results/listagg.sql.out
+++ b/sql/core/src/test/resources/sql-tests/analyzer-results/listagg.sql.out
@@ -272,7 +272,7 @@ Aggregate [listagg(col1#x, null, col2#x DESC NULLS LAST,
col1#x DESC NULLS LAST,
-- !query
-WITH t(col) AS (SELECT listagg(col1) FROM (VALUES (X'DEAD'), (X'BEEF')))
SELECT len(col), regexp_count(col, X'DEAD'), regexp_count(col, X'BEEF') FROM t
+WITH t(col) AS (SELECT listagg(col1) FROM (VALUES (X'DEAD'), (X'BEEF')))
SELECT len(col), regexp_count(hex(col), hex(X'DEAD')), regexp_count(hex(col),
hex(X'BEEF')) FROM t
-- !query analysis
WithCTE
:- CTERelationDef xxxx, false
@@ -281,13 +281,13 @@ WithCTE
: +- Aggregate [listagg(col1#x, null, 0, 0) AS listagg(col1, NULL)#x]
: +- SubqueryAlias __auto_generated_subquery_name
: +- LocalRelation [col1#x]
-+- Project [len(col#x) AS len(col)#x, regexp_count(cast(col#x as string),
cast(0xDEAD as string)) AS regexp_count(col, X'DEAD')#x,
regexp_count(cast(col#x as string), cast(0xBEEF as string)) AS
regexp_count(col, X'BEEF')#x]
++- Project [len(col#x) AS len(col)#x, regexp_count(hex(col#x), hex(0xDEAD)) AS
regexp_count(hex(col), hex(X'DEAD'))#x, regexp_count(hex(col#x), hex(0xBEEF))
AS regexp_count(hex(col), hex(X'BEEF'))#x]
+- SubqueryAlias t
+- CTERelationRef xxxx, true, [col#x], false, false, 1
-- !query
-WITH t(col) AS (SELECT listagg(col1, NULL) FROM (VALUES (X'DEAD'), (X'BEEF')))
SELECT len(col), regexp_count(col, X'DEAD'), regexp_count(col, X'BEEF') FROM t
+WITH t(col) AS (SELECT listagg(col1, NULL) FROM (VALUES (X'DEAD'), (X'BEEF')))
SELECT len(col), regexp_count(hex(col), hex(X'DEAD')), regexp_count(hex(col),
hex(X'BEEF')) FROM t
-- !query analysis
WithCTE
:- CTERelationDef xxxx, false
@@ -296,13 +296,13 @@ WithCTE
: +- Aggregate [listagg(col1#x, null, 0, 0) AS listagg(col1, NULL)#x]
: +- SubqueryAlias __auto_generated_subquery_name
: +- LocalRelation [col1#x]
-+- Project [len(col#x) AS len(col)#x, regexp_count(cast(col#x as string),
cast(0xDEAD as string)) AS regexp_count(col, X'DEAD')#x,
regexp_count(cast(col#x as string), cast(0xBEEF as string)) AS
regexp_count(col, X'BEEF')#x]
++- Project [len(col#x) AS len(col)#x, regexp_count(hex(col#x), hex(0xDEAD)) AS
regexp_count(hex(col), hex(X'DEAD'))#x, regexp_count(hex(col#x), hex(0xBEEF))
AS regexp_count(hex(col), hex(X'BEEF'))#x]
+- SubqueryAlias t
+- CTERelationRef xxxx, true, [col#x], false, false, 1
-- !query
-WITH t(col) AS (SELECT listagg(col1, X'42') FROM (VALUES (X'DEAD'),
(X'BEEF'))) SELECT len(col), regexp_count(col, X'42'), regexp_count(col,
X'DEAD'), regexp_count(col, X'BEEF') FROM t
+WITH t(col) AS (SELECT listagg(col1, X'42') FROM (VALUES (X'DEAD'),
(X'BEEF'))) SELECT len(col), regexp_count(hex(col), hex(X'42')),
regexp_count(hex(col), hex(X'DEAD')), regexp_count(hex(col), hex(X'BEEF')) FROM
t
-- !query analysis
WithCTE
:- CTERelationDef xxxx, false
@@ -311,7 +311,7 @@ WithCTE
: +- Aggregate [listagg(col1#x, 0x42, 0, 0) AS listagg(col1, X'42')#x]
: +- SubqueryAlias __auto_generated_subquery_name
: +- LocalRelation [col1#x]
-+- Project [len(col#x) AS len(col)#x, regexp_count(cast(col#x as string),
cast(0x42 as string)) AS regexp_count(col, X'42')#x, regexp_count(cast(col#x as
string), cast(0xDEAD as string)) AS regexp_count(col, X'DEAD')#x,
regexp_count(cast(col#x as string), cast(0xBEEF as string)) AS
regexp_count(col, X'BEEF')#x]
++- Project [len(col#x) AS len(col)#x, regexp_count(hex(col#x), hex(0x42)) AS
regexp_count(hex(col), hex(X'42'))#x, regexp_count(hex(col#x), hex(0xDEAD)) AS
regexp_count(hex(col), hex(X'DEAD'))#x, regexp_count(hex(col#x), hex(0xBEEF))
AS regexp_count(hex(col), hex(X'BEEF'))#x]
+- SubqueryAlias t
+- CTERelationRef xxxx, true, [col#x], false, false, 1
@@ -335,6 +335,156 @@ WithCTE
+- CTERelationRef xxxx, true, [col1#x, col2#x], false, false, 1
+-- !query
+SELECT listagg(DISTINCT col, ',') WITHIN GROUP (ORDER BY col) FROM VALUES (1),
(2), (2), (3) AS t(col)
+-- !query analysis
+Aggregate [listagg(distinct cast(col#x as string), ,, col#x ASC NULLS FIRST,
0, 0) AS listagg(DISTINCT col, ,) WITHIN GROUP (ORDER BY col ASC NULLS FIRST)#x]
++- SubqueryAlias t
+ +- LocalRelation [col#x]
+
+
+-- !query
+SELECT listagg(DISTINCT col, ',') WITHIN GROUP (ORDER BY col) FROM VALUES
(cast(1 as bigint)), (cast(2 as bigint)), (cast(2 as bigint)), (cast(3 as
bigint)) AS t(col)
+-- !query analysis
+Aggregate [listagg(distinct cast(col#xL as string), ,, col#xL ASC NULLS FIRST,
0, 0) AS listagg(DISTINCT col, ,) WITHIN GROUP (ORDER BY col ASC NULLS FIRST)#x]
++- SubqueryAlias t
+ +- LocalRelation [col#xL]
+
+
+-- !query
+SELECT listagg(DISTINCT col, ',') WITHIN GROUP (ORDER BY col) FROM VALUES
(cast(1 as smallint)), (cast(2 as smallint)), (cast(2 as smallint)), (cast(3 as
smallint)) AS t(col)
+-- !query analysis
+Aggregate [listagg(distinct cast(col#x as string), ,, col#x ASC NULLS FIRST,
0, 0) AS listagg(DISTINCT col, ,) WITHIN GROUP (ORDER BY col ASC NULLS FIRST)#x]
++- SubqueryAlias t
+ +- LocalRelation [col#x]
+
+
+-- !query
+SELECT listagg(DISTINCT col, ',') WITHIN GROUP (ORDER BY col) FROM VALUES
(cast(1 as tinyint)), (cast(2 as tinyint)), (cast(2 as tinyint)), (cast(3 as
tinyint)) AS t(col)
+-- !query analysis
+Aggregate [listagg(distinct cast(col#x as string), ,, col#x ASC NULLS FIRST,
0, 0) AS listagg(DISTINCT col, ,) WITHIN GROUP (ORDER BY col ASC NULLS FIRST)#x]
++- SubqueryAlias t
+ +- LocalRelation [col#x]
+
+
+-- !query
+SELECT listagg(DISTINCT col, ',') WITHIN GROUP (ORDER BY col) FROM VALUES
(cast(1.10 as decimal(10,2))), (cast(2.20 as decimal(10,2))), (cast(2.20 as
decimal(10,2))) AS t(col)
+-- !query analysis
+Aggregate [listagg(distinct cast(col#x as string), ,, col#x ASC NULLS FIRST,
0, 0) AS listagg(DISTINCT col, ,) WITHIN GROUP (ORDER BY col ASC NULLS FIRST)#x]
++- SubqueryAlias t
+ +- LocalRelation [col#x]
+
+
+-- !query
+SELECT listagg(DISTINCT col, ',') WITHIN GROUP (ORDER BY col) FROM VALUES
(DATE'2024-01-01'), (DATE'2024-01-02'), (DATE'2024-01-01') AS t(col)
+-- !query analysis
+Aggregate [listagg(distinct cast(col#x as string), ,, col#x ASC NULLS FIRST,
0, 0) AS listagg(DISTINCT col, ,) WITHIN GROUP (ORDER BY col ASC NULLS FIRST)#x]
++- SubqueryAlias t
+ +- LocalRelation [col#x]
+
+
+-- !query
+SELECT listagg(DISTINCT col, ',') WITHIN GROUP (ORDER BY col) FROM VALUES
(TIMESTAMP_NTZ'2024-01-01 10:00:00'), (TIMESTAMP_NTZ'2024-01-02 12:00:00'),
(TIMESTAMP_NTZ'2024-01-01 10:00:00') AS t(col)
+-- !query analysis
+Aggregate [listagg(distinct cast(col#x as string), ,, col#x ASC NULLS FIRST,
0, 0) AS listagg(DISTINCT col, ,) WITHIN GROUP (ORDER BY col ASC NULLS FIRST)#x]
++- SubqueryAlias t
+ +- LocalRelation [col#x]
+
+
+-- !query
+SELECT listagg(DISTINCT col, ',') WITHIN GROUP (ORDER BY col) FROM VALUES
(true), (false), (true) AS t(col)
+-- !query analysis
+Aggregate [listagg(distinct cast(col#x as string), ,, col#x ASC NULLS FIRST,
0, 0) AS listagg(DISTINCT col, ,) WITHIN GROUP (ORDER BY col ASC NULLS FIRST)#x]
++- SubqueryAlias t
+ +- LocalRelation [col#x]
+
+
+-- !query
+SELECT listagg(DISTINCT col, ',') WITHIN GROUP (ORDER BY col) FROM VALUES
(INTERVAL '1' MONTH), (INTERVAL '2' MONTH), (INTERVAL '1' MONTH) AS t(col)
+-- !query analysis
+Aggregate [listagg(distinct cast(col#x as string), ,, col#x ASC NULLS FIRST,
0, 0) AS listagg(DISTINCT col, ,) WITHIN GROUP (ORDER BY col ASC NULLS FIRST)#x]
++- SubqueryAlias t
+ +- LocalRelation [col#x]
+
+
+-- !query
+SELECT listagg(DISTINCT cast(col as string), ',') WITHIN GROUP (ORDER BY col)
FROM VALUES (10), (1), (2), (20), (2) AS t(col)
+-- !query analysis
+Aggregate [listagg(distinct cast(col#x as string), ,, col#x ASC NULLS FIRST,
0, 0) AS listagg(DISTINCT CAST(col AS STRING), ,) WITHIN GROUP (ORDER BY col
ASC NULLS FIRST)#x]
++- SubqueryAlias t
+ +- LocalRelation [col#x]
+
+
+-- !query
+SELECT listagg(DISTINCT col, ',') WITHIN GROUP (ORDER BY col DESC) FROM VALUES
(1), (10), (2), (20), (2) AS t(col)
+-- !query analysis
+Aggregate [listagg(distinct cast(col#x as string), ,, col#x DESC NULLS LAST,
0, 0) AS listagg(DISTINCT col, ,) WITHIN GROUP (ORDER BY col DESC NULLS LAST)#x]
++- SubqueryAlias t
+ +- LocalRelation [col#x]
+
+
+-- !query
+SELECT listagg(DISTINCT col, ',') WITHIN GROUP (ORDER BY col) FROM VALUES (1),
(2), (null), (2), (3) AS t(col)
+-- !query analysis
+Aggregate [listagg(distinct cast(col#x as string), ,, col#x ASC NULLS FIRST,
0, 0) AS listagg(DISTINCT col, ,) WITHIN GROUP (ORDER BY col ASC NULLS FIRST)#x]
++- SubqueryAlias t
+ +- LocalRelation [col#x]
+
+
+-- !query
+SELECT listagg(DISTINCT col, ',') WITHIN GROUP (ORDER BY col NULLS FIRST) FROM
VALUES (1), (null), (2), (null) AS t(col)
+-- !query analysis
+Aggregate [listagg(distinct cast(col#x as string), ,, col#x ASC NULLS FIRST,
0, 0) AS listagg(DISTINCT col, ,) WITHIN GROUP (ORDER BY col ASC NULLS FIRST)#x]
++- SubqueryAlias t
+ +- LocalRelation [col#x]
+
+
+-- !query
+SELECT grp, listagg(DISTINCT col) WITHIN GROUP (ORDER BY col) FROM VALUES (1,
'a'), (1, 'b'), (2, 'a'), (2, 'a'), (1, 'b') AS t(grp, col) GROUP BY grp
+-- !query analysis
+Aggregate [grp#x], [grp#x, listagg(distinct col#x, null, col#x ASC NULLS
FIRST, 0, 0) AS listagg(DISTINCT col, NULL) WITHIN GROUP (ORDER BY col ASC
NULLS FIRST)#x]
++- SubqueryAlias t
+ +- LocalRelation [grp#x, col#x]
+
+
+-- !query
+WITH t(col) AS (SELECT listagg(DISTINCT col1, X'2C') WITHIN GROUP (ORDER BY
col1) FROM (VALUES (X'DEAD'), (X'BEEF'), (X'DEAD'), (X'CAFE'))) SELECT
len(col), regexp_count(hex(col), hex(X'DEAD')), regexp_count(hex(col),
hex(X'BEEF')), regexp_count(hex(col), hex(X'CAFE')) FROM t
+-- !query analysis
+WithCTE
+:- CTERelationDef xxxx, false
+: +- SubqueryAlias t
+: +- Project [listagg(DISTINCT col1, X'2C') WITHIN GROUP (ORDER BY col1
ASC NULLS FIRST)#x AS col#x]
+: +- Aggregate [listagg(distinct col1#x, 0x2C, col1#x ASC NULLS FIRST,
0, 0) AS listagg(DISTINCT col1, X'2C') WITHIN GROUP (ORDER BY col1 ASC NULLS
FIRST)#x]
+: +- SubqueryAlias __auto_generated_subquery_name
+: +- LocalRelation [col1#x]
++- Project [len(col#x) AS len(col)#x, regexp_count(hex(col#x), hex(0xDEAD)) AS
regexp_count(hex(col), hex(X'DEAD'))#x, regexp_count(hex(col#x), hex(0xBEEF))
AS regexp_count(hex(col), hex(X'BEEF'))#x, regexp_count(hex(col#x),
hex(0xCAFE)) AS regexp_count(hex(col), hex(X'CAFE'))#x]
+ +- SubqueryAlias t
+ +- CTERelationRef xxxx, true, [col#x], false, false, 1
+
+
+-- !query
+WITH t(col) AS (SELECT listagg(DISTINCT col1, X'7C') WITHIN GROUP (ORDER BY
col1) FROM (VALUES (X'BB'), (X'AA'), (NULL), (X'BB'))) SELECT len(col),
regexp_count(hex(col), hex(X'AA')), regexp_count(hex(col), hex(X'BB')) FROM t
+-- !query analysis
+WithCTE
+:- CTERelationDef xxxx, false
+: +- SubqueryAlias t
+: +- Project [listagg(DISTINCT col1, X'7C') WITHIN GROUP (ORDER BY col1
ASC NULLS FIRST)#x AS col#x]
+: +- Aggregate [listagg(distinct col1#x, 0x7C, col1#x ASC NULLS FIRST,
0, 0) AS listagg(DISTINCT col1, X'7C') WITHIN GROUP (ORDER BY col1 ASC NULLS
FIRST)#x]
+: +- SubqueryAlias __auto_generated_subquery_name
+: +- LocalRelation [col1#x]
++- Project [len(col#x) AS len(col)#x, regexp_count(hex(col#x), hex(0xAA)) AS
regexp_count(hex(col), hex(X'AA'))#x, regexp_count(hex(col#x), hex(0xBB)) AS
regexp_count(hex(col), hex(X'BB'))#x]
+ +- SubqueryAlias t
+ +- CTERelationRef xxxx, true, [col#x], false, false, 1
+
+
+-- !query
+SELECT grp, hex(listagg(DISTINCT col, X'2C') WITHIN GROUP (ORDER BY col)) FROM
VALUES (1, X'AA'), (1, X'BB'), (1, X'AA'), (2, X'CC'), (2, X'CC') AS t(grp,
col) GROUP BY grp
+-- !query analysis
+Aggregate [grp#x], [grp#x, hex(listagg(distinct col#x, 0x2C, col#x ASC NULLS
FIRST, 0, 0)) AS hex(listagg(DISTINCT col, X'2C') WITHIN GROUP (ORDER BY col
ASC NULLS FIRST))#x]
++- SubqueryAlias t
+ +- LocalRelation [grp#x, col#x]
+
+
-- !query
SELECT listagg(c1) FROM (VALUES (ARRAY('a', 'b'))) AS t(c1)
-- !query analysis
@@ -507,3 +657,48 @@ org.apache.spark.sql.catalyst.ExtendedAnalysisException
"orderingExpr" : "\"col1\", \"col2\""
}
}
+
+
+-- !query
+SELECT listagg(DISTINCT col, ',') WITHIN GROUP (ORDER BY col) FROM VALUES
(cast(1.1 as double)), (cast(2.2 as double)), (cast(2.2 as double)), (cast(3.3
as double)) AS t(col)
+-- !query analysis
+org.apache.spark.sql.catalyst.ExtendedAnalysisException
+{
+ "errorClass" :
"INVALID_WITHIN_GROUP_EXPRESSION.MISMATCH_WITH_DISTINCT_INPUT_UNSAFE_CAST",
+ "sqlState" : "42K0K",
+ "messageParameters" : {
+ "castType" : "\"STRING\"",
+ "funcName" : "`listagg`",
+ "inputType" : "\"DOUBLE\""
+ }
+}
+
+
+-- !query
+SELECT listagg(DISTINCT col, ',') WITHIN GROUP (ORDER BY col) FROM VALUES
(cast(1.0 as float)), (cast(2.0 as float)), (cast(2.0 as float)) AS t(col)
+-- !query analysis
+org.apache.spark.sql.catalyst.ExtendedAnalysisException
+{
+ "errorClass" :
"INVALID_WITHIN_GROUP_EXPRESSION.MISMATCH_WITH_DISTINCT_INPUT_UNSAFE_CAST",
+ "sqlState" : "42K0K",
+ "messageParameters" : {
+ "castType" : "\"STRING\"",
+ "funcName" : "`listagg`",
+ "inputType" : "\"FLOAT\""
+ }
+}
+
+
+-- !query
+SELECT listagg(DISTINCT col, ',') WITHIN GROUP (ORDER BY col) FROM VALUES
(TIMESTAMP'2024-01-01 10:00:00'), (TIMESTAMP'2024-01-02 12:00:00'),
(TIMESTAMP'2024-01-01 10:00:00') AS t(col)
+-- !query analysis
+org.apache.spark.sql.catalyst.ExtendedAnalysisException
+{
+ "errorClass" :
"INVALID_WITHIN_GROUP_EXPRESSION.MISMATCH_WITH_DISTINCT_INPUT_UNSAFE_CAST",
+ "sqlState" : "42K0K",
+ "messageParameters" : {
+ "castType" : "\"STRING\"",
+ "funcName" : "`listagg`",
+ "inputType" : "\"TIMESTAMP\""
+ }
+}
diff --git
a/sql/core/src/test/resources/sql-tests/inputs/listagg-collations.sql
b/sql/core/src/test/resources/sql-tests/inputs/listagg-collations.sql
index aa3d02dc2fb6..86466711f93a 100644
--- a/sql/core/src/test/resources/sql-tests/inputs/listagg-collations.sql
+++ b/sql/core/src/test/resources/sql-tests/inputs/listagg-collations.sql
@@ -12,3 +12,19 @@ WITH t(c1) AS (SELECT listagg(col1) WITHIN GROUP (ORDER BY
col1 COLLATE unicode_
-- Error case with collations
SELECT listagg(DISTINCT c1 COLLATE utf8_lcase) WITHIN GROUP (ORDER BY c1
COLLATE utf8_binary) FROM (VALUES ('a'), ('b'), ('A'), ('B')) AS t(c1);
+
+-- LISTAGG DISTINCT cast safety with collations:
+-- string -> string (safe): explicit cast to same collation
+SELECT listagg(DISTINCT CAST(col AS STRING)) WITHIN GROUP (ORDER BY col) FROM
VALUES ('ABC'), ('abc'), ('ABC') AS t(col);
+-- string -> string (unsafe): cast to non-binary-equality collation on target
+SELECT listagg(DISTINCT CAST(col AS STRING COLLATE UTF8_LCASE)) WITHIN GROUP
(ORDER BY col) FROM VALUES ('ABC'), ('abc'), ('ABC') AS t(col);
+
+-- binary -> string (safe): cast to default STRING (UTF8_BINARY)
+SELECT listagg(DISTINCT CAST(col AS STRING)) WITHIN GROUP (ORDER BY col) FROM
VALUES (X'414243'), (X'616263'), (X'414243') AS t(col); -- ABC, abc, ABC
+-- binary -> string (unsafe): cast to non-binary-equality collation on target
+SELECT listagg(DISTINCT CAST(col AS STRING COLLATE UTF8_LCASE)) WITHIN GROUP
(ORDER BY col) FROM VALUES (X'414243'), (X'616263'), (X'414243') AS t(col); --
ABC, abc, ABC
+
+-- string -> binary (safe): UTF8_BINARY source, BinaryType target
+SELECT listagg(DISTINCT CAST(col AS BINARY)) WITHIN GROUP (ORDER BY col) FROM
VALUES ('ABC'), ('abc'), ('ABC') AS t(col);
+-- string -> binary (unsafe): non-binary-equality source (UTF8_LCASE),
BinaryType target
+SELECT listagg(DISTINCT CAST(col AS BINARY)) WITHIN GROUP (ORDER BY col) FROM
(SELECT col COLLATE UTF8_LCASE AS col FROM VALUES ('ABC'), ('abc'), ('ABC') AS
t(col))
diff --git a/sql/core/src/test/resources/sql-tests/inputs/listagg.sql
b/sql/core/src/test/resources/sql-tests/inputs/listagg.sql
index 021de9975d30..3a7401893aaa 100644
--- a/sql/core/src/test/resources/sql-tests/inputs/listagg.sql
+++ b/sql/core/src/test/resources/sql-tests/inputs/listagg.sql
@@ -24,11 +24,30 @@ WITH t(col) AS (SELECT listagg(col1, '|') WITHIN GROUP
(ORDER BY col2 DESC) FROM
SELECT listagg(col1, '|') WITHIN GROUP (ORDER BY col2 DESC) FROM df;
SELECT listagg(col1) WITHIN GROUP (ORDER BY col2 DESC, col1 ASC) FROM df;
SELECT listagg(col1) WITHIN GROUP (ORDER BY col2 DESC, col1 DESC) FROM df;
-WITH t(col) AS (SELECT listagg(col1) FROM (VALUES (X'DEAD'), (X'BEEF')))
SELECT len(col), regexp_count(col, X'DEAD'), regexp_count(col, X'BEEF') FROM t;
-WITH t(col) AS (SELECT listagg(col1, NULL) FROM (VALUES (X'DEAD'), (X'BEEF')))
SELECT len(col), regexp_count(col, X'DEAD'), regexp_count(col, X'BEEF') FROM t;
-WITH t(col) AS (SELECT listagg(col1, X'42') FROM (VALUES (X'DEAD'),
(X'BEEF'))) SELECT len(col), regexp_count(col, X'42'), regexp_count(col,
X'DEAD'), regexp_count(col, X'BEEF') FROM t;
+WITH t(col) AS (SELECT listagg(col1) FROM (VALUES (X'DEAD'), (X'BEEF')))
SELECT len(col), regexp_count(hex(col), hex(X'DEAD')), regexp_count(hex(col),
hex(X'BEEF')) FROM t;
+WITH t(col) AS (SELECT listagg(col1, NULL) FROM (VALUES (X'DEAD'), (X'BEEF')))
SELECT len(col), regexp_count(hex(col), hex(X'DEAD')), regexp_count(hex(col),
hex(X'BEEF')) FROM t;
+WITH t(col) AS (SELECT listagg(col1, X'42') FROM (VALUES (X'DEAD'),
(X'BEEF'))) SELECT len(col), regexp_count(hex(col), hex(X'42')),
regexp_count(hex(col), hex(X'DEAD')), regexp_count(hex(col), hex(X'BEEF')) FROM
t;
WITH t(col1, col2) AS (SELECT listagg(col1), listagg(col2, ',') FROM df2)
SELECT len(col1), regexp_count(col1, '1'), regexp_count(col1, '2'),
regexp_count(col1, '3'), len(col2), regexp_count(col2, 'true'),
regexp_count(col1, 'false') FROM t;
+-- LISTAGG with DISTINCT with implicit cast from non-string types (safe types
- should succeed)
+SELECT listagg(DISTINCT col, ',') WITHIN GROUP (ORDER BY col) FROM VALUES (1),
(2), (2), (3) AS t(col);
+SELECT listagg(DISTINCT col, ',') WITHIN GROUP (ORDER BY col) FROM VALUES
(cast(1 as bigint)), (cast(2 as bigint)), (cast(2 as bigint)), (cast(3 as
bigint)) AS t(col);
+SELECT listagg(DISTINCT col, ',') WITHIN GROUP (ORDER BY col) FROM VALUES
(cast(1 as smallint)), (cast(2 as smallint)), (cast(2 as smallint)), (cast(3 as
smallint)) AS t(col);
+SELECT listagg(DISTINCT col, ',') WITHIN GROUP (ORDER BY col) FROM VALUES
(cast(1 as tinyint)), (cast(2 as tinyint)), (cast(2 as tinyint)), (cast(3 as
tinyint)) AS t(col);
+SELECT listagg(DISTINCT col, ',') WITHIN GROUP (ORDER BY col) FROM VALUES
(cast(1.10 as decimal(10,2))), (cast(2.20 as decimal(10,2))), (cast(2.20 as
decimal(10,2))) AS t(col);
+SELECT listagg(DISTINCT col, ',') WITHIN GROUP (ORDER BY col) FROM VALUES
(DATE'2024-01-01'), (DATE'2024-01-02'), (DATE'2024-01-01') AS t(col);
+SELECT listagg(DISTINCT col, ',') WITHIN GROUP (ORDER BY col) FROM VALUES
(TIMESTAMP_NTZ'2024-01-01 10:00:00'), (TIMESTAMP_NTZ'2024-01-02 12:00:00'),
(TIMESTAMP_NTZ'2024-01-01 10:00:00') AS t(col);
+SELECT listagg(DISTINCT col, ',') WITHIN GROUP (ORDER BY col) FROM VALUES
(true), (false), (true) AS t(col);
+SELECT listagg(DISTINCT col, ',') WITHIN GROUP (ORDER BY col) FROM VALUES
(INTERVAL '1' MONTH), (INTERVAL '2' MONTH), (INTERVAL '1' MONTH) AS t(col);
+SELECT listagg(DISTINCT cast(col as string), ',') WITHIN GROUP (ORDER BY col)
FROM VALUES (10), (1), (2), (20), (2) AS t(col);
+SELECT listagg(DISTINCT col, ',') WITHIN GROUP (ORDER BY col DESC) FROM VALUES
(1), (10), (2), (20), (2) AS t(col);
+SELECT listagg(DISTINCT col, ',') WITHIN GROUP (ORDER BY col) FROM VALUES (1),
(2), (null), (2), (3) AS t(col);
+SELECT listagg(DISTINCT col, ',') WITHIN GROUP (ORDER BY col NULLS FIRST) FROM
VALUES (1), (null), (2), (null) AS t(col);
+SELECT grp, listagg(DISTINCT col) WITHIN GROUP (ORDER BY col) FROM VALUES (1,
'a'), (1, 'b'), (2, 'a'), (2, 'a'), (1, 'b') AS t(grp, col) GROUP BY grp;
+WITH t(col) AS (SELECT listagg(DISTINCT col1, X'2C') WITHIN GROUP (ORDER BY
col1) FROM (VALUES (X'DEAD'), (X'BEEF'), (X'DEAD'), (X'CAFE'))) SELECT
len(col), regexp_count(hex(col), hex(X'DEAD')), regexp_count(hex(col),
hex(X'BEEF')), regexp_count(hex(col), hex(X'CAFE')) FROM t;
+WITH t(col) AS (SELECT listagg(DISTINCT col1, X'7C') WITHIN GROUP (ORDER BY
col1) FROM (VALUES (X'BB'), (X'AA'), (NULL), (X'BB'))) SELECT len(col),
regexp_count(hex(col), hex(X'AA')), regexp_count(hex(col), hex(X'BB')) FROM t;
+SELECT grp, hex(listagg(DISTINCT col, X'2C') WITHIN GROUP (ORDER BY col)) FROM
VALUES (1, X'AA'), (1, X'BB'), (1, X'AA'), (2, X'CC'), (2, X'CC') AS t(grp,
col) GROUP BY grp;
+
-- Error cases
SELECT listagg(c1) FROM (VALUES (ARRAY('a', 'b'))) AS t(c1);
SELECT listagg(c1, ', ') FROM (VALUES (X'DEAD'), (X'BEEF')) AS t(c1);
@@ -39,3 +58,6 @@ SELECT string_agg(col1) WITHIN GROUP (ORDER BY col1) OVER
(ORDER BY col1) FROM d
SELECT listagg(DISTINCT col1) OVER (ORDER BY col1) FROM df;
SELECT listagg(DISTINCT col1) WITHIN GROUP (ORDER BY col2) FROM df;
SELECT listagg(DISTINCT col1) WITHIN GROUP (ORDER BY col1, col2) FROM df;
+SELECT listagg(DISTINCT col, ',') WITHIN GROUP (ORDER BY col) FROM VALUES
(cast(1.1 as double)), (cast(2.2 as double)), (cast(2.2 as double)), (cast(3.3
as double)) AS t(col);
+SELECT listagg(DISTINCT col, ',') WITHIN GROUP (ORDER BY col) FROM VALUES
(cast(1.0 as float)), (cast(2.0 as float)), (cast(2.0 as float)) AS t(col);
+SELECT listagg(DISTINCT col, ',') WITHIN GROUP (ORDER BY col) FROM VALUES
(TIMESTAMP'2024-01-01 10:00:00'), (TIMESTAMP'2024-01-02 12:00:00'),
(TIMESTAMP'2024-01-01 10:00:00') AS t(col);
diff --git
a/sql/core/src/test/resources/sql-tests/results/listagg-collations.sql.out
b/sql/core/src/test/resources/sql-tests/results/listagg-collations.sql.out
index 1f8c5822e7d8..6e826a3ed9be 100644
--- a/sql/core/src/test/resources/sql-tests/results/listagg-collations.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/listagg-collations.sql.out
@@ -80,3 +80,78 @@ org.apache.spark.sql.catalyst.ExtendedAnalysisException
"orderingExpr" : "\"collate(c1, utf8_binary)\""
}
}
+
+
+-- !query
+SELECT listagg(DISTINCT CAST(col AS STRING)) WITHIN GROUP (ORDER BY col) FROM
VALUES ('ABC'), ('abc'), ('ABC') AS t(col)
+-- !query schema
+struct<listagg(DISTINCT CAST(col AS STRING), NULL) WITHIN GROUP (ORDER BY col
ASC NULLS FIRST):string>
+-- !query output
+ABCabc
+
+
+-- !query
+SELECT listagg(DISTINCT CAST(col AS STRING COLLATE UTF8_LCASE)) WITHIN GROUP
(ORDER BY col) FROM VALUES ('ABC'), ('abc'), ('ABC') AS t(col)
+-- !query schema
+struct<>
+-- !query output
+org.apache.spark.sql.catalyst.ExtendedAnalysisException
+{
+ "errorClass" :
"INVALID_WITHIN_GROUP_EXPRESSION.MISMATCH_WITH_DISTINCT_INPUT_UNSAFE_CAST",
+ "sqlState" : "42K0K",
+ "messageParameters" : {
+ "castType" : "\"STRING COLLATE UTF8_LCASE\"",
+ "funcName" : "`listagg`",
+ "inputType" : "\"STRING\""
+ }
+}
+
+
+-- !query
+SELECT listagg(DISTINCT CAST(col AS STRING)) WITHIN GROUP (ORDER BY col) FROM
VALUES (X'414243'), (X'616263'), (X'414243') AS t(col)
+-- !query schema
+struct<listagg(DISTINCT CAST(col AS STRING), NULL) WITHIN GROUP (ORDER BY col
ASC NULLS FIRST):string>
+-- !query output
+ABCabc
+
+
+-- !query
+SELECT listagg(DISTINCT CAST(col AS STRING COLLATE UTF8_LCASE)) WITHIN GROUP
(ORDER BY col) FROM VALUES (X'414243'), (X'616263'), (X'414243') AS t(col)
+-- !query schema
+struct<>
+-- !query output
+org.apache.spark.sql.catalyst.ExtendedAnalysisException
+{
+ "errorClass" :
"INVALID_WITHIN_GROUP_EXPRESSION.MISMATCH_WITH_DISTINCT_INPUT_UNSAFE_CAST",
+ "sqlState" : "42K0K",
+ "messageParameters" : {
+ "castType" : "\"STRING COLLATE UTF8_LCASE\"",
+ "funcName" : "`listagg`",
+ "inputType" : "\"BINARY\""
+ }
+}
+
+
+-- !query
+SELECT listagg(DISTINCT CAST(col AS BINARY)) WITHIN GROUP (ORDER BY col) FROM
VALUES ('ABC'), ('abc'), ('ABC') AS t(col)
+-- !query schema
+struct<listagg(DISTINCT CAST(col AS BINARY), NULL) WITHIN GROUP (ORDER BY col
ASC NULLS FIRST):binary>
+-- !query output
+ABCabc
+
+
+-- !query
+SELECT listagg(DISTINCT CAST(col AS BINARY)) WITHIN GROUP (ORDER BY col) FROM
(SELECT col COLLATE UTF8_LCASE AS col FROM VALUES ('ABC'), ('abc'), ('ABC') AS
t(col))
+-- !query schema
+struct<>
+-- !query output
+org.apache.spark.sql.catalyst.ExtendedAnalysisException
+{
+ "errorClass" :
"INVALID_WITHIN_GROUP_EXPRESSION.MISMATCH_WITH_DISTINCT_INPUT_UNSAFE_CAST",
+ "sqlState" : "42K0K",
+ "messageParameters" : {
+ "castType" : "\"BINARY\"",
+ "funcName" : "`listagg`",
+ "inputType" : "\"STRING COLLATE UTF8_LCASE\""
+ }
+}
diff --git a/sql/core/src/test/resources/sql-tests/results/listagg.sql.out
b/sql/core/src/test/resources/sql-tests/results/listagg.sql.out
index 22ffceedb627..ca387fe70fa1 100644
--- a/sql/core/src/test/resources/sql-tests/results/listagg.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/listagg.sql.out
@@ -158,25 +158,25 @@ bbaa
-- !query
-WITH t(col) AS (SELECT listagg(col1) FROM (VALUES (X'DEAD'), (X'BEEF')))
SELECT len(col), regexp_count(col, X'DEAD'), regexp_count(col, X'BEEF') FROM t
+WITH t(col) AS (SELECT listagg(col1) FROM (VALUES (X'DEAD'), (X'BEEF')))
SELECT len(col), regexp_count(hex(col), hex(X'DEAD')), regexp_count(hex(col),
hex(X'BEEF')) FROM t
-- !query schema
-struct<len(col):int,regexp_count(col, X'DEAD'):int,regexp_count(col,
X'BEEF'):int>
+struct<len(col):int,regexp_count(hex(col),
hex(X'DEAD')):int,regexp_count(hex(col), hex(X'BEEF')):int>
-- !query output
4 1 1
-- !query
-WITH t(col) AS (SELECT listagg(col1, NULL) FROM (VALUES (X'DEAD'), (X'BEEF')))
SELECT len(col), regexp_count(col, X'DEAD'), regexp_count(col, X'BEEF') FROM t
+WITH t(col) AS (SELECT listagg(col1, NULL) FROM (VALUES (X'DEAD'), (X'BEEF')))
SELECT len(col), regexp_count(hex(col), hex(X'DEAD')), regexp_count(hex(col),
hex(X'BEEF')) FROM t
-- !query schema
-struct<len(col):int,regexp_count(col, X'DEAD'):int,regexp_count(col,
X'BEEF'):int>
+struct<len(col):int,regexp_count(hex(col),
hex(X'DEAD')):int,regexp_count(hex(col), hex(X'BEEF')):int>
-- !query output
4 1 1
-- !query
-WITH t(col) AS (SELECT listagg(col1, X'42') FROM (VALUES (X'DEAD'),
(X'BEEF'))) SELECT len(col), regexp_count(col, X'42'), regexp_count(col,
X'DEAD'), regexp_count(col, X'BEEF') FROM t
+WITH t(col) AS (SELECT listagg(col1, X'42') FROM (VALUES (X'DEAD'),
(X'BEEF'))) SELECT len(col), regexp_count(hex(col), hex(X'42')),
regexp_count(hex(col), hex(X'DEAD')), regexp_count(hex(col), hex(X'BEEF')) FROM
t
-- !query schema
-struct<len(col):int,regexp_count(col, X'42'):int,regexp_count(col,
X'DEAD'):int,regexp_count(col, X'BEEF'):int>
+struct<len(col):int,regexp_count(hex(col),
hex(X'42')):int,regexp_count(hex(col), hex(X'DEAD')):int,regexp_count(hex(col),
hex(X'BEEF')):int>
-- !query output
5 1 1 1
@@ -189,6 +189,144 @@ struct<len(col1):int,regexp_count(col1,
1):int,regexp_count(col1, 2):int,regexp_
3 1 1 1 16 1 0
+-- !query
+SELECT listagg(DISTINCT col, ',') WITHIN GROUP (ORDER BY col) FROM VALUES (1),
(2), (2), (3) AS t(col)
+-- !query schema
+struct<listagg(DISTINCT col, ,) WITHIN GROUP (ORDER BY col ASC NULLS
FIRST):string>
+-- !query output
+1,2,3
+
+
+-- !query
+SELECT listagg(DISTINCT col, ',') WITHIN GROUP (ORDER BY col) FROM VALUES
(cast(1 as bigint)), (cast(2 as bigint)), (cast(2 as bigint)), (cast(3 as
bigint)) AS t(col)
+-- !query schema
+struct<listagg(DISTINCT col, ,) WITHIN GROUP (ORDER BY col ASC NULLS
FIRST):string>
+-- !query output
+1,2,3
+
+
+-- !query
+SELECT listagg(DISTINCT col, ',') WITHIN GROUP (ORDER BY col) FROM VALUES
(cast(1 as smallint)), (cast(2 as smallint)), (cast(2 as smallint)), (cast(3 as
smallint)) AS t(col)
+-- !query schema
+struct<listagg(DISTINCT col, ,) WITHIN GROUP (ORDER BY col ASC NULLS
FIRST):string>
+-- !query output
+1,2,3
+
+
+-- !query
+SELECT listagg(DISTINCT col, ',') WITHIN GROUP (ORDER BY col) FROM VALUES
(cast(1 as tinyint)), (cast(2 as tinyint)), (cast(2 as tinyint)), (cast(3 as
tinyint)) AS t(col)
+-- !query schema
+struct<listagg(DISTINCT col, ,) WITHIN GROUP (ORDER BY col ASC NULLS
FIRST):string>
+-- !query output
+1,2,3
+
+
+-- !query
+SELECT listagg(DISTINCT col, ',') WITHIN GROUP (ORDER BY col) FROM VALUES
(cast(1.10 as decimal(10,2))), (cast(2.20 as decimal(10,2))), (cast(2.20 as
decimal(10,2))) AS t(col)
+-- !query schema
+struct<listagg(DISTINCT col, ,) WITHIN GROUP (ORDER BY col ASC NULLS
FIRST):string>
+-- !query output
+1.10,2.20
+
+
+-- !query
+SELECT listagg(DISTINCT col, ',') WITHIN GROUP (ORDER BY col) FROM VALUES
(DATE'2024-01-01'), (DATE'2024-01-02'), (DATE'2024-01-01') AS t(col)
+-- !query schema
+struct<listagg(DISTINCT col, ,) WITHIN GROUP (ORDER BY col ASC NULLS
FIRST):string>
+-- !query output
+2024-01-01,2024-01-02
+
+
+-- !query
+SELECT listagg(DISTINCT col, ',') WITHIN GROUP (ORDER BY col) FROM VALUES
(TIMESTAMP_NTZ'2024-01-01 10:00:00'), (TIMESTAMP_NTZ'2024-01-02 12:00:00'),
(TIMESTAMP_NTZ'2024-01-01 10:00:00') AS t(col)
+-- !query schema
+struct<listagg(DISTINCT col, ,) WITHIN GROUP (ORDER BY col ASC NULLS
FIRST):string>
+-- !query output
+2024-01-01 10:00:00,2024-01-02 12:00:00
+
+
+-- !query
+SELECT listagg(DISTINCT col, ',') WITHIN GROUP (ORDER BY col) FROM VALUES
(true), (false), (true) AS t(col)
+-- !query schema
+struct<listagg(DISTINCT col, ,) WITHIN GROUP (ORDER BY col ASC NULLS
FIRST):string>
+-- !query output
+false,true
+
+
+-- !query
+SELECT listagg(DISTINCT col, ',') WITHIN GROUP (ORDER BY col) FROM VALUES
(INTERVAL '1' MONTH), (INTERVAL '2' MONTH), (INTERVAL '1' MONTH) AS t(col)
+-- !query schema
+struct<listagg(DISTINCT col, ,) WITHIN GROUP (ORDER BY col ASC NULLS
FIRST):string>
+-- !query output
+INTERVAL '1' MONTH,INTERVAL '2' MONTH
+
+
+-- !query
+SELECT listagg(DISTINCT cast(col as string), ',') WITHIN GROUP (ORDER BY col)
FROM VALUES (10), (1), (2), (20), (2) AS t(col)
+-- !query schema
+struct<listagg(DISTINCT CAST(col AS STRING), ,) WITHIN GROUP (ORDER BY col ASC
NULLS FIRST):string>
+-- !query output
+1,2,10,20
+
+
+-- !query
+SELECT listagg(DISTINCT col, ',') WITHIN GROUP (ORDER BY col DESC) FROM VALUES
(1), (10), (2), (20), (2) AS t(col)
+-- !query schema
+struct<listagg(DISTINCT col, ,) WITHIN GROUP (ORDER BY col DESC NULLS
LAST):string>
+-- !query output
+20,10,2,1
+
+
+-- !query
+SELECT listagg(DISTINCT col, ',') WITHIN GROUP (ORDER BY col) FROM VALUES (1),
(2), (null), (2), (3) AS t(col)
+-- !query schema
+struct<listagg(DISTINCT col, ,) WITHIN GROUP (ORDER BY col ASC NULLS
FIRST):string>
+-- !query output
+1,2,3
+
+
+-- !query
+SELECT listagg(DISTINCT col, ',') WITHIN GROUP (ORDER BY col NULLS FIRST) FROM
VALUES (1), (null), (2), (null) AS t(col)
+-- !query schema
+struct<listagg(DISTINCT col, ,) WITHIN GROUP (ORDER BY col ASC NULLS
FIRST):string>
+-- !query output
+1,2
+
+
+-- !query
+SELECT grp, listagg(DISTINCT col) WITHIN GROUP (ORDER BY col) FROM VALUES (1,
'a'), (1, 'b'), (2, 'a'), (2, 'a'), (1, 'b') AS t(grp, col) GROUP BY grp
+-- !query schema
+struct<grp:int,listagg(DISTINCT col, NULL) WITHIN GROUP (ORDER BY col ASC
NULLS FIRST):string>
+-- !query output
+1 ab
+2 a
+
+
+-- !query
+WITH t(col) AS (SELECT listagg(DISTINCT col1, X'2C') WITHIN GROUP (ORDER BY
col1) FROM (VALUES (X'DEAD'), (X'BEEF'), (X'DEAD'), (X'CAFE'))) SELECT
len(col), regexp_count(hex(col), hex(X'DEAD')), regexp_count(hex(col),
hex(X'BEEF')), regexp_count(hex(col), hex(X'CAFE')) FROM t
+-- !query schema
+struct<len(col):int,regexp_count(hex(col),
hex(X'DEAD')):int,regexp_count(hex(col),
hex(X'BEEF')):int,regexp_count(hex(col), hex(X'CAFE')):int>
+-- !query output
+8 1 1 1
+
+
+-- !query
+WITH t(col) AS (SELECT listagg(DISTINCT col1, X'7C') WITHIN GROUP (ORDER BY
col1) FROM (VALUES (X'BB'), (X'AA'), (NULL), (X'BB'))) SELECT len(col),
regexp_count(hex(col), hex(X'AA')), regexp_count(hex(col), hex(X'BB')) FROM t
+-- !query schema
+struct<len(col):int,regexp_count(hex(col),
hex(X'AA')):int,regexp_count(hex(col), hex(X'BB')):int>
+-- !query output
+3 1 1
+
+
+-- !query
+SELECT grp, hex(listagg(DISTINCT col, X'2C') WITHIN GROUP (ORDER BY col)) FROM
VALUES (1, X'AA'), (1, X'BB'), (1, X'AA'), (2, X'CC'), (2, X'CC') AS t(grp,
col) GROUP BY grp
+-- !query schema
+struct<grp:int,hex(listagg(DISTINCT col, X'2C') WITHIN GROUP (ORDER BY col ASC
NULLS FIRST)):string>
+-- !query output
+1 AA2CBB
+2 CC
+
+
-- !query
SELECT listagg(c1) FROM (VALUES (ARRAY('a', 'b'))) AS t(c1)
-- !query schema
@@ -374,3 +512,54 @@ org.apache.spark.sql.catalyst.ExtendedAnalysisException
"orderingExpr" : "\"col1\", \"col2\""
}
}
+
+
+-- !query
+SELECT listagg(DISTINCT col, ',') WITHIN GROUP (ORDER BY col) FROM VALUES
(cast(1.1 as double)), (cast(2.2 as double)), (cast(2.2 as double)), (cast(3.3
as double)) AS t(col)
+-- !query schema
+struct<>
+-- !query output
+org.apache.spark.sql.catalyst.ExtendedAnalysisException
+{
+ "errorClass" :
"INVALID_WITHIN_GROUP_EXPRESSION.MISMATCH_WITH_DISTINCT_INPUT_UNSAFE_CAST",
+ "sqlState" : "42K0K",
+ "messageParameters" : {
+ "castType" : "\"STRING\"",
+ "funcName" : "`listagg`",
+ "inputType" : "\"DOUBLE\""
+ }
+}
+
+
+-- !query
+SELECT listagg(DISTINCT col, ',') WITHIN GROUP (ORDER BY col) FROM VALUES
(cast(1.0 as float)), (cast(2.0 as float)), (cast(2.0 as float)) AS t(col)
+-- !query schema
+struct<>
+-- !query output
+org.apache.spark.sql.catalyst.ExtendedAnalysisException
+{
+ "errorClass" :
"INVALID_WITHIN_GROUP_EXPRESSION.MISMATCH_WITH_DISTINCT_INPUT_UNSAFE_CAST",
+ "sqlState" : "42K0K",
+ "messageParameters" : {
+ "castType" : "\"STRING\"",
+ "funcName" : "`listagg`",
+ "inputType" : "\"FLOAT\""
+ }
+}
+
+
+-- !query
+SELECT listagg(DISTINCT col, ',') WITHIN GROUP (ORDER BY col) FROM VALUES
(TIMESTAMP'2024-01-01 10:00:00'), (TIMESTAMP'2024-01-02 12:00:00'),
(TIMESTAMP'2024-01-01 10:00:00') AS t(col)
+-- !query schema
+struct<>
+-- !query output
+org.apache.spark.sql.catalyst.ExtendedAnalysisException
+{
+ "errorClass" :
"INVALID_WITHIN_GROUP_EXPRESSION.MISMATCH_WITH_DISTINCT_INPUT_UNSAFE_CAST",
+ "sqlState" : "42K0K",
+ "messageParameters" : {
+ "castType" : "\"STRING\"",
+ "funcName" : "`listagg`",
+ "inputType" : "\"TIMESTAMP\""
+ }
+}
diff --git
a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala
b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala
index f606c6746f3c..355333d8268f 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala
@@ -709,6 +709,31 @@ class DataFrameAggregateSuite extends QueryTest
)
}
+ test("SPARK-55501: listagg with DISTINCT and ORDER BY") {
+ val df = Seq(1, 2, 10, 1, 9).toDF("a")
+
+ withSQLConf(SQLConf.LISTAGG_ALLOW_DISTINCT_CAST_WITH_ORDER.key -> "true") {
+ checkAnswer(
+ df.selectExpr("listagg(distinct a, ', ') within group (order by a)"),
+ Seq(Row("1, 2, 9, 10"))
+ )
+ }
+
+ withSQLConf(SQLConf.LISTAGG_ALLOW_DISTINCT_CAST_WITH_ORDER.key -> "false")
{
+ checkError(
+ exception = intercept[AnalysisException] {
+ df.selectExpr("listagg(distinct a) within group (order by a)")
+ },
+ condition =
"INVALID_WITHIN_GROUP_EXPRESSION.MISMATCH_WITH_DISTINCT_INPUT",
+ parameters = Map(
+ "funcName" -> "`listagg`",
+ "funcArg" -> "\"a\"",
+ "orderingExpr" -> "\"a\""
+ )
+ )
+ }
+ }
+
test("SPARK-31500: collect_set() of BinaryType returns duplicate elements") {
val bytesTest1 = "test1".getBytes
val bytesTest2 = "test2".getBytes
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]