This is an automated email from the ASF dual-hosted git repository.
beliefer pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push:
new c7bba9bfcc3 [SPARK-45755][SQL] Improve `Dataset.isEmpty()` by applying
global limit `1`
c7bba9bfcc3 is described below
commit c7bba9bfcc350bd3508dd6bb41da6f0c1fef63c6
Author: Yuming Wang <[email protected]>
AuthorDate: Wed Nov 1 19:24:57 2023 +0800
[SPARK-45755][SQL] Improve `Dataset.isEmpty()` by applying global limit `1`
### What changes were proposed in this pull request?
This PR makes `Dataset.isEmpty()` to execute global limit 1 first.
`LimitPushDown` may push down global limit 1 to lower nodes to improve query
performance.
Note that we use global limit 1 here, because the local limit cannot be
pushed down the group only case:
https://github.com/apache/spark/blob/89ca8b6065e9f690a492c778262080741d50d94d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L766-L770
### Why are the changes needed?
Improve query performance.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manual testing:
```scala
spark.range(300000000).selectExpr("id", "array(id, id % 10, id % 100) as
eo").write.saveAsTable("t1")
spark.range(100000000).selectExpr("id", "array(id, id % 10, id % 1000) as
eo").write.saveAsTable("t2")
println(spark.sql("SELECT * FROM t1 LATERAL VIEW explode_outer(eo) AS e
UNION SELECT * FROM t2 LATERAL VIEW explode_outer(eo) AS e").isEmpty)
```
Before this PR | After this PR
-- | --
<img width="430" alt="image"
src="https://github.com/apache/spark/assets/5399861/417adc05-4160-4470-b63c-125faac08c9c">
| <img width="430" alt="image"
src="https://github.com/apache/spark/assets/5399861/bdeff231-e725-4c55-9da2-1b4cd59ec8c8">
### Was this patch authored or co-authored using generative AI tooling?
No.
Closes #43617 from wangyum/SPARK-45755.
Lead-authored-by: Yuming Wang <[email protected]>
Co-authored-by: Yuming Wang <[email protected]>
Signed-off-by: Jiaan Geng <[email protected]>
---
sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
index ba5eb790cea..a567a915daf 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
@@ -652,7 +652,7 @@ class Dataset[T] private[sql](
* @group basic
* @since 2.4.0
*/
- def isEmpty: Boolean = withAction("isEmpty", select().queryExecution) { plan
=>
+ def isEmpty: Boolean = withAction("isEmpty",
select().limit(1).queryExecution) { plan =>
plan.executeTake(1).isEmpty
}
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]