alamb commented on code in PR #21240:
URL: https://github.com/apache/datafusion/pull/21240#discussion_r3146705559
##########
docs/source/library-user-guide/upgrading/54.0.0.md:
##########
@@ -320,6 +320,92 @@ The difference is only observable for strings containing
combining characters
clusters (e.g., ZWJ emoji sequences). For ASCII and most common Unicode text,
behavior is unchanged.
+### Scalar subquery execution changes
+
+Uncorrelated scalar subqueries (e.g. `SELECT ... WHERE x > (SELECT max(v) FROM
t)`)
+are now executed by a dedicated physical operator rather than being rewritten
to
+a join. Correlated scalar subqueries are unchanged.
+
+This produces two user-visible changes:
+
+- **Subqueries that return multiple rows now fail at runtime.** An uncorrelated
+ scalar subquery that returns more than one row fails with `Execution error:
Scalar subquery returned more than one row`. This matches the SQL standard and
+ the behavior of most other SQL implementations. The previous join-based
+ rewrite could silently produce multi-row output. Add a `LIMIT 1` or an
+ aggregate to the subquery to fix such queries.
+- **Plan shape changes.** Uncorrelated `Expr::ScalarSubquery` nodes now survive
+ into the final logical plan instead of being replaced by a join; the
+ corresponding physical plan contains a new `ScalarSubqueryExec` node and a
+ `ScalarSubqueryExpr` expression. Code that walks or transforms `LogicalPlan`
/
+ `ExecutionPlan` trees, as well as `EXPLAIN` output, may need updating.
+
+### `datafusion-proto`: expression deserialization now takes a `TaskContext`
+
+`Serializeable::from_bytes_with_registry` is renamed to `from_bytes_with_ctx`
+and takes a `&TaskContext` instead of a `&dyn FunctionRegistry`. `parse_expr`,
+`parse_exprs`, and `parse_sorts` take the same change. `Expr::from_bytes`
+(without a registry argument) is unchanged.
+
+```diff
+-let expr = Expr::from_bytes_with_registry(&bytes, ®istry)?;
++let expr = Expr::from_bytes_with_ctx(&bytes, ctx.task_ctx().as_ref())?;
+```
+
+```diff
+-let expr = parse_expr(&proto, ®istry, &codec)?;
++let expr = parse_expr(&proto, ctx.task_ctx().as_ref(), &codec)?;
+```
+
+### `datafusion-proto`: `PhysicalProtoConverterExtension` reshaped
+
+`PhysicalProtoConverterExtension` and the `parse_physical_*_with_converter`
+helpers now take a single `&PhysicalPlanDecodeContext<'_>` that bundles the
+`TaskContext` and the `PhysicalExtensionCodec`. Implementations update like
+this:
+
+```diff
+ impl PhysicalProtoConverterExtension for MyConverter {
+ fn proto_to_execution_plan(
+ &self,
+- ctx: &TaskContext,
+- codec: &dyn PhysicalExtensionCodec,
+ proto: &protobuf::PhysicalPlanNode,
++ ctx: &PhysicalPlanDecodeContext<'_>,
+ ) -> Result<Arc<dyn ExecutionPlan>> {
+- proto.try_into_physical_plan_with_converter(ctx, codec, self)
++ self.default_proto_to_execution_plan(proto, ctx)
+ }
+
+ fn proto_to_physical_expr(
+ &self,
+ proto: &PhysicalExprNode,
+- ctx: &TaskContext,
+ input_schema: &Schema,
+- codec: &dyn PhysicalExtensionCodec,
++ ctx: &PhysicalPlanDecodeContext<'_>,
+ ) -> Result<Arc<dyn PhysicalExpr>> {
+- parse_physical_expr_with_converter(proto, ctx, input_schema, codec,
self)
++ self.default_proto_to_physical_expr(proto, input_schema, ctx)
+ }
+ }
+```
+
+Pull out the `TaskContext` or codec inside these methods with
+`ctx.task_ctx()` and `ctx.codec()`. Construct a fresh context at an API
+boundary with `PhysicalPlanDecodeContext::new(task_ctx, codec)`.
+
+### `ExecutionProps` has new fields
+
+`ExecutionProps` gained new public fields. Code that constructs it via a
+struct literal, or pattern-matches it without `..`, no longer compiles. Use
+`ExecutionProps::new()` and include `..` in exhaustive patterns.
+
+### Wire format: scalar subquery serialization
+
+`datafusion-proto` adds new oneof variants to serialize scalar subqueries.
+Plans produced by DataFusion 54 that contain scalar subqueries cannot be
+decoded by older versions — upgrade producers and consumers together.
Review Comment:
I think this is always true for proto, see
https://docs.rs/datafusion-proto/latest/datafusion_proto/#version-compatibility
Thus I don't think it is necessary to restate it in the upgrade guide
##########
datafusion/proto/src/physical_plan/mod.rs:
##########
@@ -117,6 +119,58 @@ use crate::{convert_required, into_required};
pub mod from_proto;
pub mod to_proto;
+/// Context threaded through physical-plan deserialization.
+///
+/// This bundles the stable per-call inputs for deserialization and the
+/// per-scope `ScalarSubqueryResults` handle needed while reconstructing
+/// `ScalarSubqueryExpr` nodes inside a `ScalarSubqueryExec` input plan.
+#[derive(Clone)]
+pub struct PhysicalPlanDecodeContext<'a> {
+ task_ctx: &'a TaskContext,
+ codec: &'a dyn PhysicalExtensionCodec,
+ scalar_subquery_results: Option<ScalarSubqueryResults>,
Review Comment:
I see this and I wonder if longer term we should try and attach the scalar
subquery results to the TaskContext longer term
However, we probably still need to pass it on ExecutionProps and it seems
like then we would have to make sure the copy on ExecutionProps matches the one
on TaskContext which isn't ideal either.
##########
.github/workflows/large_files.yml:
##########
@@ -34,9 +34,9 @@ jobs:
fetch-depth: 0
- name: Check size of new Git objects
env:
- # 1 MB ought to be enough for anybody.
Review Comment:
I think a tighter limit would be better, as this check exists to make sure
people don't accidentally add large objects in the repo
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]