[spark] branch master updated: [SPARK-41790][SQL] Set TRANSFORM reader and writer's format correctly

gurwls223 Thu, 05 Jan 2023 00:52:47 -0800

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/master by this push:
     new ede8c52de88 [SPARK-41790][SQL] Set TRANSFORM reader and writer's 
format correctly
ede8c52de88 is described below

commit ede8c52de8878cbcd098284d5c632ea8fa4ebf67
Author: maming <[email protected]>
AuthorDate: Thu Jan 5 17:52:19 2023 +0900

    [SPARK-41790][SQL] Set TRANSFORM reader and writer's format correctly
    
    ### What changes were proposed in this pull request?
    We'll get wrong data when transform only specify reader or writer 's row 
format delimited, the reason is using the wrong format to feed/fetch data 
to/from running script now. we should set the format correctly.
    
    Currently in Spark:
    ```sql
    spark-sql> CREATE TABLE t1 (a string, b string);
    
    spark-sql> INSERT OVERWRITE t1 VALUES("1", "2"), ("3", "4");
    
    spark-sql> SELECT TRANSFORM(a, b)
             >   ROW FORMAT DELIMITED
             >   FIELDS TERMINATED BY ','
             >   USING 'cat'
             >   AS (c)
             > FROM t1;
    c
    
    spark-sql> SELECT TRANSFORM(a, b)
             >   USING 'cat'
             >   AS (c)
             >   ROW FORMAT DELIMITED
             >   FIELDS TERMINATED BY ','
             > FROM t1;
    c
    1    23    4
    ```
    
    The same sql in hive:
    ```sql
    hive> SELECT TRANSFORM(a, b)
        >   ROW FORMAT DELIMITED
        >   FIELDS TERMINATED BY ','
        >   USING 'cat'
        >   AS (c)
        > FROM t1;
    c
    1,2
    3,4
    
    hive> SELECT TRANSFORM(a, b)
        >   USING 'cat'
        >   AS (c)
        >   ROW FORMAT DELIMITED
        >   FIELDS TERMINATED BY ','
        > FROM t1;
    c
    1    2
    3    4
    ```
    
    ### Why are the changes needed?
    Fix transform writer format and reader format.
    
    ### Does this PR introduce _any_ user-facing change?
    When we set transform's row format delimited in the sql, we may get the 
wrong data.
    
    ### How was this patch tested?
    New tests.
    
    Closes #39315 from mattshma/SPARK-41790.
    
    Lead-authored-by: maming <[email protected]>
    Co-authored-by: mattshma <[email protected]>
    Signed-off-by: Hyukjin Kwon <[email protected]>
---
 .../apache/spark/sql/execution/SparkSqlParser.scala  | 15 +++++++++------
 .../execution/HiveScriptTransformationSuite.scala    | 20 ++++++++++++++++++++
 2 files changed, 29 insertions(+), 6 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala
index ad0599775de..e67ffa606ef 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala
@@ -751,15 +751,18 @@ class SparkSqlAstBuilder extends AstBuilder {
           (Nil, Option(name), props, recordHandler)
       }
 
-      val (inFormat, inSerdeClass, inSerdeProps, reader) =
+      // The Writer uses inFormat to feed input data into the running script 
and
+      // the reader uses outFormat to read the output from the running script,
+      // this behavior is same with hive.
+      val (inFormat, inSerdeClass, inSerdeProps, writer) =
         format(
-          inRowFormat, "hive.script.recordreader",
-          "org.apache.hadoop.hive.ql.exec.TextRecordReader")
+          inRowFormat, "hive.script.recordwriter",
+          "org.apache.hadoop.hive.ql.exec.TextRecordWriter")
 
-      val (outFormat, outSerdeClass, outSerdeProps, writer) =
+      val (outFormat, outSerdeClass, outSerdeProps, reader) =
         format(
-          outRowFormat, "hive.script.recordwriter",
-          "org.apache.hadoop.hive.ql.exec.TextRecordWriter")
+          outRowFormat, "hive.script.recordreader",
+          "org.apache.hadoop.hive.ql.exec.TextRecordReader")
 
       ScriptInputOutputSchema(
         inFormat, outFormat,
diff --git 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveScriptTransformationSuite.scala
 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveScriptTransformationSuite.scala
index ad4a311528a..4e8a62acddd 100644
--- 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveScriptTransformationSuite.scala
+++ 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveScriptTransformationSuite.scala
@@ -638,4 +638,24 @@ class HiveScriptTransformationSuite extends 
BaseScriptTransformationSuite with T
         Row("1") :: Row("2") :: Row("3") :: Nil)
     }
   }
+
+  test("SPARK-41790: Set TRANSFORM reader and writer's format correctly") {
+    withTempView("v") {
+      val df = Seq(
+        (1, 2)
+      ).toDF("a", "b")
+      df.createTempView("v")
+
+      checkAnswer(
+        sql(
+          s"""
+             |SELECT TRANSFORM(a, b)
+             |  ROW FORMAT DELIMITED
+             |  FIELDS TERMINATED BY ','
+             |  USING 'cat'
+             |  AS (c)
+             |FROM v
+          """.stripMargin), identity, Row("1,2") :: Nil)
+    }
+  }
 }


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[spark] branch master updated: [SPARK-41790][SQL] Set TRANSFORM reader and writer's format correctly

Reply via email to