This is an automated email from the ASF dual-hosted git repository.
yumwang pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/branch-3.4 by this push:
new a96804b7542 [SPARK-45071][SQL] Optimize the processing speed of
`BinaryArithmetic#dataType` when processing multi-column data
a96804b7542 is described below
commit a96804b7542c1cbac8e274dba2c322b84f47fa2d
Author: zzzzming95 <[email protected]>
AuthorDate: Wed Sep 6 11:38:30 2023 +0800
[SPARK-45071][SQL] Optimize the processing speed of
`BinaryArithmetic#dataType` when processing multi-column data
### What changes were proposed in this pull request?
Since `BinaryArithmetic#dataType` will recursively process the datatype of
each node, the driver will be very slow when multiple columns are processed.
For example, the following code:
```scala
import spark.implicits._
import scala.util.Random
import org.apache.spark.sql.Row
import org.apache.spark.sql.functions.sum
import org.apache.spark.sql.types.{StructType, StructField, IntegerType}
val N = 30
val M = 100
val columns = Seq.fill(N)(Random.alphanumeric.take(8).mkString)
val data = Seq.fill(M)(Seq.fill(N)(Random.nextInt(16) - 5))
val schema = StructType(columns.map(StructField(_, IntegerType)))
val rdd = spark.sparkContext.parallelize(data.map(Row.fromSeq(_)))
val df = spark.createDataFrame(rdd, schema)
val colExprs = columns.map(sum(_))
// gen a new column , and add the other 30 column
df.withColumn("new_col_sum", expr(columns.mkString(" + ")))
```
This code will take a few minutes for the driver to execute in the spark3.4
version, but only takes a few seconds to execute in the spark3.2 version.
Related issue: [SPARK-39316](https://github.com/apache/spark/pull/36698)
### Why are the changes needed?
Optimize the processing speed of `BinaryArithmetic#dataType` when
processing multi-column data
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
manual testing
### Was this patch authored or co-authored using generative AI tooling?
no
Closes #42804 from zzzzming95/SPARK-45071.
Authored-by: zzzzming95 <[email protected]>
Signed-off-by: Yuming Wang <[email protected]>
(cherry picked from commit 16e813cecd55490a71ef6c05fca2209fbdae078f)
Signed-off-by: Yuming Wang <[email protected]>
---
.../apache/spark/sql/catalyst/expressions/arithmetic.scala | 12 +++++++-----
1 file changed, 7 insertions(+), 5 deletions(-)
diff --git
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala
index 88f7fabf121..1396e4d1259 100644
---
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala
+++
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala
@@ -218,6 +218,12 @@ abstract class BinaryArithmetic extends BinaryOperator
protected val evalMode: EvalMode.Value
+ private lazy val internalDataType: DataType = (left.dataType,
right.dataType) match {
+ case (DecimalType.Fixed(p1, s1), DecimalType.Fixed(p2, s2)) =>
+ resultDecimalType(p1, s1, p2, s2)
+ case _ => left.dataType
+ }
+
protected def failOnError: Boolean = evalMode match {
// The TRY mode executes as if it would fail on errors, except that it
would capture the errors
// and return null results.
@@ -233,11 +239,7 @@ abstract class BinaryArithmetic extends BinaryOperator
case _ => super.checkInputDataTypes()
}
- override def dataType: DataType = (left.dataType, right.dataType) match {
- case (DecimalType.Fixed(p1, s1), DecimalType.Fixed(p2, s2)) =>
- resultDecimalType(p1, s1, p2, s2)
- case _ => left.dataType
- }
+ override def dataType: DataType = internalDataType
// When `spark.sql.decimalOperations.allowPrecisionLoss` is set to true, if
the precision / scale
// needed are out of the range of available values, the scale is reduced up
to 6, in order to
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]