neilconway commented on code in PR #22736:
URL: https://github.com/apache/datafusion/pull/22736#discussion_r3351560421
##########
datafusion/core/src/physical_planner.rs:
##########
@@ -442,15 +474,30 @@ impl DefaultPhysicalPlanner {
// scalar subqueries to joins, so none should reach this point.
// Skip collection in that case to avoid creating a no-op
// `ScalarSubqueryExec` wrapper.
- let all_subqueries = if session_state
+ if !session_state
.config_options()
.optimizer
.enable_physical_uncorrelated_scalar_subquery
{
- Self::collect_scalar_subqueries(logical_plan)
+ return self
+ .create_initial_plan_inner(logical_plan, session_state)
+ .await;
+ }
+
+ let mut all_subqueries =
Self::collect_scalar_subqueries(logical_plan);
+ let freshened = if
all_subqueries.iter().any(Subquery::is_volatile) {
+ Some(Self::freshen_volatile_subqueries(logical_plan)?)
} else {
- Vec::new()
+ None
};
+ let logical_plan = match &freshened {
+ Some(freshened) => {
+ all_subqueries =
Self::collect_scalar_subqueries(freshened);
+ freshened
+ }
+ None => logical_plan,
+ };
Review Comment:
Calling `collect_scalar_subqueries` twice is unfortunate; as is the need to
traverse the whole tree yet again to freshen the volatile SQs. I'm also not
crazy about distinguishing between two volatile-containing subqueries based
purely on the value of an Arc pointer.
Stepping back a bit, whether a SQ is volatile is not is a local property of
the SQ, so we should be able to determine how a subquery should be handled as
part of a single tree traversal. What if we did something like:
* Do a single pass over the plan
* When we encounter a subquery, determine if it is volatile or not
* If it's volatile, give it a fresh index. If non-volatile, look it up
(structural equality) and assign/reuse an index as appropriate
* Write the index back into the logical plan node
That means that two textually equal SQs containing volatile expressions will
compare non-equal via structural equality, so I think we'll get the semantics
we want? It's a bit gross to write a field back into the logical plan but I
can't think of something cleaner at the moment.
##########
datafusion/expr/src/expr.rs:
##########
@@ -2145,10 +2145,25 @@ impl Expr {
/// For example the function call `RANDOM()` is volatile as each call will
/// return a different value.
///
+ /// This also descends into subquery-bearing expressions, so
+ /// `(SELECT random())` is volatile even though the scalar subquery node is
+ /// not itself a volatile function.
+ ///
/// See [`Volatility`] for more information.
pub fn is_volatile(&self) -> bool {
- self.exists(|expr| Ok(expr.is_volatile_node()))
- .expect("exists closure is infallible")
+ self.exists(|expr| {
+ let subquery_is_volatile = match expr {
+ Expr::ScalarSubquery(subquery)
+ | Expr::Exists(Exists { subquery, .. })
+ | Expr::InSubquery(InSubquery { subquery, .. })
+ | Expr::SetComparison(SetComparison { subquery, .. }) => {
+ subquery.is_volatile()
+ }
+ _ => false,
+ };
+ Ok(expr.is_volatile_node() || subquery_is_volatile)
+ })
+ .expect("exists closure is infallible")
}
Review Comment:
```
$ ag is_volatile | wc -l
77
```
We should be careful about changing the semantics of a widely used API like
this. It's not clear to me that the change is _wrong_ (or right), but it has
both behavioral and performance consequences. We should look at all of those
existing call-sites and understand whether recursing into subqueries is the
right behavior or not. And if we are going to change this, we should ensure it
has unit test coverage.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]