spark git commit: [SPARK-25505][SQL] The output order of grouping columns in Pivot is different from the input order

lixiao Fri, 28 Sep 2018 00:09:15 -0700

Repository: spark
Updated Branches:
  refs/heads/master 3b7395fe0 -> e120a38c0



[SPARK-25505][SQL] The output order of grouping columns in Pivot is different 
from the input order

## What changes were proposed in this pull request?

The grouping columns from a Pivot query are inferred as "input columns - pivot 
columns - pivot aggregate columns", where input columns are the output of the 
child relation of Pivot. The grouping columns will be the leading columns in 
the pivot output and they should preserve the same order as specified by the 
input. For example,
```
SELECT * FROM (
  SELECT course, earnings, "a" as a, "z" as z, "b" as b, "y" as y, "c" as c, 
"x" as x, "d" as d, "w" as w
  FROM courseSales
)
PIVOT (
  sum(earnings)
  FOR course IN ('dotNET', 'Java')
)
```
The output columns should be "a, z, b, y, c, x, d, w, ..." but now it is "a, b, 
c, d, w, x, y, z, ..."

The fix is to use the child plan's `output` instead of `outputSet` so that the 
order can be preserved.

## How was this patch tested?

Added UT.

Closes #22519 from maryannxue/spark-25505.

Authored-by: maryannxue <[email protected]>
Signed-off-by: gatorsmile <[email protected]>


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e120a38c
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e120a38c
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e120a38c

Branch: refs/heads/master
Commit: e120a38c0cdfb569c9151bef4d53e98175da2b25
Parents: 3b7395f
Author: maryannxue <[email protected]>
Authored: Fri Sep 28 00:09:06 2018 -0700
Committer: gatorsmile <[email protected]>
Committed: Fri Sep 28 00:09:06 2018 -0700

----------------------------------------------------------------------
 .../spark/sql/catalyst/analysis/Analyzer.scala     |  7 +++++--
 .../src/test/resources/sql-tests/inputs/pivot.sql  | 10 ++++++++++
 .../test/resources/sql-tests/results/pivot.sql.out | 17 ++++++++++++++++-
 3 files changed, 31 insertions(+), 3 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/e120a38c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
----------------------------------------------------------------------
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
index 7034dfd..c0a7308 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
@@ -554,8 +554,11 @@ class Analyzer(
           Cast(value, pivotColumn.dataType, 
Some(conf.sessionLocalTimeZone)).eval(EmptyRow)
         }
         // Group-by expressions coming from SQL are implicit and need to be 
deduced.
-        val groupByExprs = groupByExprsOpt.getOrElse(
-          (child.outputSet -- aggregates.flatMap(_.references) -- 
pivotColumn.references).toSeq)
+        val groupByExprs = groupByExprsOpt.getOrElse {
+          val pivotColAndAggRefs =
+            (pivotColumn.references ++ aggregates.flatMap(_.references)).toSet
+          child.output.filterNot(pivotColAndAggRefs.contains)
+        }
         val singleAgg = aggregates.size == 1
         def outputName(value: Expression, aggregate: Expression): String = {
           val stringValue = value match {

http://git-wip-us.apache.org/repos/asf/spark/blob/e120a38c/sql/core/src/test/resources/sql-tests/inputs/pivot.sql
----------------------------------------------------------------------
diff --git a/sql/core/src/test/resources/sql-tests/inputs/pivot.sql 
b/sql/core/src/test/resources/sql-tests/inputs/pivot.sql
index 1f607b3..81547ab 100644
--- a/sql/core/src/test/resources/sql-tests/inputs/pivot.sql
+++ b/sql/core/src/test/resources/sql-tests/inputs/pivot.sql
@@ -287,3 +287,13 @@ PIVOT (
   sum(earnings)
   FOR (course, m) IN (('dotNET', map('1', 1)), ('Java', map('2', 2)))
 );
+
+-- grouping columns output in the same order as input
+SELECT * FROM (
+  SELECT course, earnings, "a" as a, "z" as z, "b" as b, "y" as y, "c" as c, 
"x" as x, "d" as d, "w" as w
+  FROM courseSales
+)
+PIVOT (
+  sum(earnings)
+  FOR course IN ('dotNET', 'Java')
+);

http://git-wip-us.apache.org/repos/asf/spark/blob/e120a38c/sql/core/src/test/resources/sql-tests/results/pivot.sql.out
----------------------------------------------------------------------
diff --git a/sql/core/src/test/resources/sql-tests/results/pivot.sql.out 
b/sql/core/src/test/resources/sql-tests/results/pivot.sql.out
index 2dd9293..487883a 100644
--- a/sql/core/src/test/resources/sql-tests/results/pivot.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/pivot.sql.out
@@ -1,5 +1,5 @@
 -- Automatically generated by SQLQueryTestSuite
--- Number of queries: 31
+-- Number of queries: 32
 
 
 -- !query 0
@@ -476,3 +476,18 @@ struct<>
 -- !query 30 output
 org.apache.spark.sql.AnalysisException
 Invalid pivot column 'named_struct(course, course#x, m, m#x)'. Pivot columns 
must be comparable.;
+
+
+-- !query 31
+SELECT * FROM (
+  SELECT course, earnings, "a" as a, "z" as z, "b" as b, "y" as y, "c" as c, 
"x" as x, "d" as d, "w" as w
+  FROM courseSales
+)
+PIVOT (
+  sum(earnings)
+  FOR course IN ('dotNET', 'Java')
+)
+-- !query 31 schema
+struct<a:string,z:string,b:string,y:string,c:string,x:string,d:string,w:string,dotNET:bigint,Java:bigint>
+-- !query 31 output
+a      z       b       y       c       x       d       w       63000   50000


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

spark git commit: [SPARK-25505][SQL] The output order of grouping columns in Pivot is different from the input order

Reply via email to