(spark) branch master updated: [SPARK-51935][SQL] Fix lazy behavior of iterators in interpreted df.collect()

wenchen Mon, 28 Apr 2025 17:44:27 -0700

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/master by this push:
     new 291f4a3e3db9 [SPARK-51935][SQL] Fix lazy behavior of iterators in 
interpreted df.collect()
291f4a3e3db9 is described below

commit 291f4a3e3db92dac6a56df5c82b68a922cc49951
Author: Harsh Motwani <harsh.motw...@databricks.com>
AuthorDate: Tue Apr 29 08:44:06 2025 +0800

    [SPARK-51935][SQL] Fix lazy behavior of iterators in interpreted 
df.collect()
    
    ### What changes were proposed in this pull request?
    
    Prior to this PR, in a very specific codepath in the non-SQL expression 
`MapObjects`, an iterator is returned by df.collect() when codegen is disabled. 
This can lead to non-deterministic behavior when codegen is disabled. For 
example the following code could return incorrect results:
    ```
    spark.conf.set("spark.sql.codegen.factoryMode", "NO_CODEGEN")
    
    import org.apache.spark.sql.Row
    
    case class Posting(transfer_type: String)
    
    val df3 = spark.range(0, 8, 1, 1).selectExpr("""
    case when id = 0 then array(named_struct('a', '1'), named_struct('a', '3'))
    when id = 1 then null
    when id = 2 then null
    when id = 3 then null
    when id = 4 then null
    when id = 5 then null
    when id = 6 then array(named_struct('a', '5'), named_struct('a', '7'))
    else array(named_struct('a', '9'), named_struct('a', '11'))
    end result
    """)
    
    val res6 = df3.select("result").map(row =>
      if (row.get(0) != null) row.get(0).asInstanceOf[Seq[Row]].map(v => 
Posting(v.get(0).asInstanceOf[String]))
      else null
    ).collect()
    ```
    
    ### Why are the changes needed?
    
    The previous code that returned an iterator could give incorrect results 
when a DataFrame represents scala case classes. This PR fixes it.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Users could see more deterministic behavior when forcing Interpreted mode 
and using scala objects in DataFrames.
    
    ### How was this patch tested?
    
    Existing tests
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No
    
    Closes #50741 from harshmotw-db/harsh-motwani_data/SPARK-51935.
    
    Authored-by: Harsh Motwani <harsh.motw...@databricks.com>
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>
---
 .../org/apache/spark/sql/catalyst/expressions/objects/objects.scala     | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
index 90ab8725e553..28e934b5df95 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
@@ -992,7 +992,7 @@ case class MapObjects private(
       }
     case Some(cls) if classOf[scala.collection.Seq[_]].isAssignableFrom(cls) =>
       // Scala sequence
-      executeFuncOnCollection(_).toSeq
+      executeFuncOnCollection(_).toIndexedSeq
     case Some(cls) if classOf[scala.collection.Set[_]].isAssignableFrom(cls) =>
       // Scala set
       executeFuncOnCollection(_).toSet


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-51935][SQL] Fix lazy behavior of iterators in interpreted df.collect()

Reply via email to