neilconway commented on code in PR #22736:
URL: https://github.com/apache/datafusion/pull/22736#discussion_r3351560421


##########
datafusion/core/src/physical_planner.rs:
##########
@@ -442,15 +474,30 @@ impl DefaultPhysicalPlanner {
             // scalar subqueries to joins, so none should reach this point.
             // Skip collection in that case to avoid creating a no-op
             // `ScalarSubqueryExec` wrapper.
-            let all_subqueries = if session_state
+            if !session_state
                 .config_options()
                 .optimizer
                 .enable_physical_uncorrelated_scalar_subquery
             {
-                Self::collect_scalar_subqueries(logical_plan)
+                return self
+                    .create_initial_plan_inner(logical_plan, session_state)
+                    .await;
+            }
+
+            let mut all_subqueries = 
Self::collect_scalar_subqueries(logical_plan);
+            let freshened = if 
all_subqueries.iter().any(Subquery::is_volatile) {
+                Some(Self::freshen_volatile_subqueries(logical_plan)?)
             } else {
-                Vec::new()
+                None
             };
+            let logical_plan = match &freshened {
+                Some(freshened) => {
+                    all_subqueries = 
Self::collect_scalar_subqueries(freshened);
+                    freshened
+                }
+                None => logical_plan,
+            };

Review Comment:
   Calling `collect_scalar_subqueries` twice is unfortunate; as is the need to 
traverse the whole tree yet again to freshen the volatile SQs. I'm also not 
crazy about distinguishing between two volatile-containing subqueries based 
purely on the value of an Arc pointer.
   
   Stepping back a bit, whether a SQ is volatile is not is a local property of 
the SQ, so we should be able to determine how a subquery should be handled as 
part of a single tree traversal. What if we did something like:
   
   * Do a single pass over the plan
   * When we encounter a subquery, determine if it is volatile or not
   * If it's volatile, give it a fresh index. If non-volatile, look it up 
(structural equality) and assign/reuse an index as appropriate
   * Write the index back into the logical plan node
   
   That means that two textually equal SQs containing volatile expressions will 
compare non-equal via structural equality, so I think we'll get the semantics 
we want? It's a bit gross to write a field back into the logical plan but I 
can't think of something cleaner at the moment.



##########
datafusion/expr/src/expr.rs:
##########
@@ -2145,10 +2145,25 @@ impl Expr {
     /// For example the function call `RANDOM()` is volatile as each call will
     /// return a different value.
     ///
+    /// This also descends into subquery-bearing expressions, so
+    /// `(SELECT random())` is volatile even though the scalar subquery node is
+    /// not itself a volatile function.
+    ///
     /// See [`Volatility`] for more information.
     pub fn is_volatile(&self) -> bool {
-        self.exists(|expr| Ok(expr.is_volatile_node()))
-            .expect("exists closure is infallible")
+        self.exists(|expr| {
+            let subquery_is_volatile = match expr {
+                Expr::ScalarSubquery(subquery)
+                | Expr::Exists(Exists { subquery, .. })
+                | Expr::InSubquery(InSubquery { subquery, .. })
+                | Expr::SetComparison(SetComparison { subquery, .. }) => {
+                    subquery.is_volatile()
+                }
+                _ => false,
+            };
+            Ok(expr.is_volatile_node() || subquery_is_volatile)
+        })
+        .expect("exists closure is infallible")
     }

Review Comment:
   ```
   $ ag is_volatile | wc -l
         77
   ```
   
   We should be careful about changing the semantics of a widely used API like 
this. It's not clear to me that the change is _wrong_ (or right), but it has 
both behavioral and performance consequences. We should look at all of those 
existing call-sites and understand whether recursing into subqueries is the 
right behavior or not. And if we are going to change this, we should ensure it 
has unit test coverage.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to