Re: [PR] perf: Implement physical execution of uncorrelated scalar subqueries [datafusion]

via GitHub Mon, 27 Apr 2026 04:07:59 -0700


alamb commented on code in PR #21240:
URL: https://github.com/apache/datafusion/pull/21240#discussion_r3146705559



##########
docs/source/library-user-guide/upgrading/54.0.0.md:
##########
@@ -320,6 +320,92 @@ The difference is only observable for strings containing 
combining characters
 clusters (e.g., ZWJ emoji sequences). For ASCII and most common Unicode text,
 behavior is unchanged.
 
+### Scalar subquery execution changes
+
+Uncorrelated scalar subqueries (e.g. `SELECT ... WHERE x > (SELECT max(v) FROM 
t)`)
+are now executed by a dedicated physical operator rather than being rewritten 
to
+a join. Correlated scalar subqueries are unchanged.
+
+This produces two user-visible changes:
+
+- **Subqueries that return multiple rows now fail at runtime.** An uncorrelated
+  scalar subquery that returns more than one row fails with `Execution error: 
Scalar subquery returned more than one row`. This matches the SQL standard and
+  the behavior of most other SQL implementations. The previous join-based
+  rewrite could silently produce multi-row output. Add a `LIMIT 1` or an
+  aggregate to the subquery to fix such queries.
+- **Plan shape changes.** Uncorrelated `Expr::ScalarSubquery` nodes now survive
+  into the final logical plan instead of being replaced by a join; the
+  corresponding physical plan contains a new `ScalarSubqueryExec` node and a
+  `ScalarSubqueryExpr` expression. Code that walks or transforms `LogicalPlan` 
/
+  `ExecutionPlan` trees, as well as `EXPLAIN` output, may need updating.
+
+### `datafusion-proto`: expression deserialization now takes a `TaskContext`
+
+`Serializeable::from_bytes_with_registry` is renamed to `from_bytes_with_ctx`
+and takes a `&TaskContext` instead of a `&dyn FunctionRegistry`. `parse_expr`,
+`parse_exprs`, and `parse_sorts` take the same change. `Expr::from_bytes`
+(without a registry argument) is unchanged.
+
+```diff
+-let expr = Expr::from_bytes_with_registry(&bytes, &registry)?;
++let expr = Expr::from_bytes_with_ctx(&bytes, ctx.task_ctx().as_ref())?;
+```
+
+```diff
+-let expr = parse_expr(&proto, &registry, &codec)?;
++let expr = parse_expr(&proto, ctx.task_ctx().as_ref(), &codec)?;
+```
+
+### `datafusion-proto`: `PhysicalProtoConverterExtension` reshaped
+
+`PhysicalProtoConverterExtension` and the `parse_physical_*_with_converter`
+helpers now take a single `&PhysicalPlanDecodeContext<'_>` that bundles the
+`TaskContext` and the `PhysicalExtensionCodec`. Implementations update like
+this:
+
+```diff
+ impl PhysicalProtoConverterExtension for MyConverter {
+     fn proto_to_execution_plan(
+         &self,
+-        ctx: &TaskContext,
+-        codec: &dyn PhysicalExtensionCodec,
+         proto: &protobuf::PhysicalPlanNode,
++        ctx: &PhysicalPlanDecodeContext<'_>,
+     ) -> Result<Arc<dyn ExecutionPlan>> {
+-        proto.try_into_physical_plan_with_converter(ctx, codec, self)
++        self.default_proto_to_execution_plan(proto, ctx)
+     }
+
+     fn proto_to_physical_expr(
+         &self,
+         proto: &PhysicalExprNode,
+-        ctx: &TaskContext,
+         input_schema: &Schema,
+-        codec: &dyn PhysicalExtensionCodec,
++        ctx: &PhysicalPlanDecodeContext<'_>,
+     ) -> Result<Arc<dyn PhysicalExpr>> {
+-        parse_physical_expr_with_converter(proto, ctx, input_schema, codec, 
self)
++        self.default_proto_to_physical_expr(proto, input_schema, ctx)
+     }
+ }
+```
+
+Pull out the `TaskContext` or codec inside these methods with
+`ctx.task_ctx()` and `ctx.codec()`. Construct a fresh context at an API
+boundary with `PhysicalPlanDecodeContext::new(task_ctx, codec)`.
+
+### `ExecutionProps` has new fields
+
+`ExecutionProps` gained new public fields. Code that constructs it via a
+struct literal, or pattern-matches it without `..`, no longer compiles. Use
+`ExecutionProps::new()` and include `..` in exhaustive patterns.
+
+### Wire format: scalar subquery serialization
+
+`datafusion-proto` adds new oneof variants to serialize scalar subqueries.
+Plans produced by DataFusion 54 that contain scalar subqueries cannot be
+decoded by older versions — upgrade producers and consumers together.

Review Comment:
   I think this is always true for proto, see 
https://docs.rs/datafusion-proto/latest/datafusion_proto/#version-compatibility
   
   Thus I don't think it is necessary to restate it in the upgrade guide



##########
datafusion/proto/src/physical_plan/mod.rs:
##########
@@ -117,6 +119,58 @@ use crate::{convert_required, into_required};
 pub mod from_proto;
 pub mod to_proto;
 
+/// Context threaded through physical-plan deserialization.
+///
+/// This bundles the stable per-call inputs for deserialization and the
+/// per-scope `ScalarSubqueryResults` handle needed while reconstructing
+/// `ScalarSubqueryExpr` nodes inside a `ScalarSubqueryExec` input plan.
+#[derive(Clone)]
+pub struct PhysicalPlanDecodeContext<'a> {
+    task_ctx: &'a TaskContext,
+    codec: &'a dyn PhysicalExtensionCodec,
+    scalar_subquery_results: Option<ScalarSubqueryResults>,

Review Comment:
   I see this and I wonder if longer term we should try and attach the scalar 
subquery results to the TaskContext longer term
   
   However, we probably still need to pass it on ExecutionProps and it seems 
like then we would have to make sure the copy on ExecutionProps matches the one 
on TaskContext which isn't ideal either. 



##########
.github/workflows/large_files.yml:
##########
@@ -34,9 +34,9 @@ jobs:
           fetch-depth: 0
       - name: Check size of new Git objects
         env:
-          # 1 MB ought to be enough for anybody.

Review Comment:
   I think a tighter limit would be better, as this check exists to make sure 
people don't accidentally add large objects in the repo



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] perf: Implement physical execution of uncorrelated scalar subqueries [datafusion]

Reply via email to