(spark) branch master updated: [SPARK-45755][SQL] Improve `Dataset.isEmpty()` by applying global limit `1`

beliefer Wed, 01 Nov 2023 04:25:21 -0700

This is an automated email from the ASF dual-hosted git repository.

beliefer pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/master by this push:
     new c7bba9bfcc3 [SPARK-45755][SQL] Improve `Dataset.isEmpty()` by applying 
global limit `1`
c7bba9bfcc3 is described below

commit c7bba9bfcc350bd3508dd6bb41da6f0c1fef63c6
Author: Yuming Wang <[email protected]>
AuthorDate: Wed Nov 1 19:24:57 2023 +0800

    [SPARK-45755][SQL] Improve `Dataset.isEmpty()` by applying global limit `1`
    
    ### What changes were proposed in this pull request?
    
    This PR makes `Dataset.isEmpty()` to execute global limit 1 first. 
`LimitPushDown` may push down global limit 1 to lower nodes to improve query 
performance.
    
    Note that we use global limit 1 here, because the local limit cannot be 
pushed down the group only case: 
https://github.com/apache/spark/blob/89ca8b6065e9f690a492c778262080741d50d94d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L766-L770
    
    ### Why are the changes needed?
    
    Improve query performance.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Manual testing:
    ```scala
    spark.range(300000000).selectExpr("id", "array(id, id % 10, id % 100) as 
eo").write.saveAsTable("t1")
    spark.range(100000000).selectExpr("id", "array(id, id % 10, id % 1000) as 
eo").write.saveAsTable("t2")
    println(spark.sql("SELECT * FROM t1 LATERAL VIEW explode_outer(eo) AS e 
UNION SELECT * FROM t2 LATERAL VIEW explode_outer(eo) AS e").isEmpty)
    ```
    
    Before this PR | After this PR
    -- | --
    <img width="430" alt="image" 
src="https://github.com/apache/spark/assets/5399861/417adc05-4160-4470-b63c-125faac08c9c";>
 | <img width="430" alt="image" 
src="https://github.com/apache/spark/assets/5399861/bdeff231-e725-4c55-9da2-1b4cd59ec8c8";>
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes #43617 from wangyum/SPARK-45755.
    
    Lead-authored-by: Yuming Wang <[email protected]>
    Co-authored-by: Yuming Wang <[email protected]>
    Signed-off-by: Jiaan Geng <[email protected]>
---
 sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
index ba5eb790cea..a567a915daf 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
@@ -652,7 +652,7 @@ class Dataset[T] private[sql](
    * @group basic
    * @since 2.4.0
    */
-  def isEmpty: Boolean = withAction("isEmpty", select().queryExecution) { plan 
=>
+  def isEmpty: Boolean = withAction("isEmpty", 
select().limit(1).queryExecution) { plan =>
     plan.executeTake(1).isEmpty
   }
 


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(spark) branch master updated: [SPARK-45755][SQL] Improve `Dataset.isEmpty()` by applying global limit `1`

Reply via email to