adriangb opened a new pull request, #21966:
URL: https://github.com/apache/datafusion/pull/21966

   ## Which issue does this PR close?
   
   Related to <https://github.com/apache/datafusion/issues/21835> (no `Closes` 
— this PR only does the `PhysicalExpr` half and only registers two built-ins; 
full closure is multi-PR).
   
   ## Rationale for this change
   
   DataFusion has no in-tree serialization on the `PhysicalExpr` trait. The 
proto crate handles serialization with a downcast chain inside 
`datafusion-proto/src/physical_plan/{from_proto,to_proto}.rs`, which forces 
every expression author to expose private state via `pub` accessors so the 
proto crate can poke at it (#21835).
   
   PR #21929's draft branch added a `PhysicalExpr::to_proto` hook to fix that — 
but the hook is protobuf-specific. If we ever want JSON, bincode, or anything 
else, we'd need a parallel `to_json`, `to_bincode`, etc., one trait method per 
format.
   
   This PR introduces a single `serde`-style hook on `PhysicalExpr` that works 
with any `serde` format (JSON for debugging, bincode for binary, a serde 
adapter on top of protobuf, etc.). The shape is intentionally minimal so it can 
be reviewed in isolation:
   
   - `PhysicalExpr::serde_tag(&self) -> &'static str` — default `""` means "not 
serializable".
   - `PhysicalExpr::erased_serialize(&self) -> Box<dyn erased_serde::Serialize 
+ '_>` — default returns an error sentinel.
   - `PhysicalExprDeserialize` trait + `PhysicalExprRegistry` builder + 
`DeserializeContext` over `&mut dyn erased_serde::Deserializer<'de>`.
   
   The deserialization side runs through any `serde::Deserializer` 
(`Registry::deserialize`) with `deserialize_json` as a convenience entry point. 
Trait-object children recurse through `registry.expr_seed()` inside hand-rolled 
visitors.
   
   Wire stability across DataFusion versions is **not** a goal here — the 
`datafusion-proto` crate stays the path for that. The serde hook is for 
in-process tooling, debugging, and intra-cluster ephemeral serialization. A 
future PR can write a serde adapter on top of prost so proto becomes another 
consumer of `erased_serialize`, at which point the proto crate's downcast chain 
can collapse.
   
   ## What changes are included in this PR?
   
   - **Trait additions on `PhysicalExpr`** (`physical-expr-common`): 
`serde_tag` and `erased_serialize`, both with default impls so this is 
non-breaking for existing in-tree and out-of-tree implementers.
   - **`impl Serialize for dyn PhysicalExpr`** producing a `{tag, data}` 
envelope. `Arc<dyn PhysicalExpr>` is automatically `Serialize` via serde's `rc` 
feature.
   - **`PhysicalExprDeserialize` trait + `PhysicalExprRegistry`** 
(`physical-expr-common`). Builder-style (`empty().with::<T>().with::<U>()`), 
modeled on the optimizer's rule list. No globals, no `inventory` crate (which 
has wasm/FFI gotchas DataFusion cares about).
     - `with::<T>()` panics on duplicate tag; `try_with::<T>()` returns 
`Result` for the runtime-plugin case.
   - **Two built-ins wired up**: `Column` (leaf) and `NotExpr` (single 
trait-object child, exercises the `DeserializeContext::deserialize_unary` 
helper).
   - **Runnable example** at `datafusion-examples/examples/physical_expr_serde` 
— defines a custom `PhysicalExpr` with a trait-object child, registers it on 
top of `default_registry()`, and round-trips through JSON.
   - **Library-user-guide page** walking through the trait surface, child 
handling, tag uniqueness, format choice, and how this differs from 
`datafusion-proto`.
   
   The remaining built-ins follow the same shape and will land in follow-up PRs 
in batches grouped by what they need from the rest of the codebase (`Operator` 
first, then `ScalarValue`/`arrow::DataType` for 
`Literal`/`Cast*`/`Like`/`InList`/`Case`).
   
   ## Are these changes tested?
   
   Yes:
   - 6 unit tests in `datafusion-physical-expr` exercise: Column round-trip, 
NotExpr round-trip, NOT(NOT(a)) nested round-trip, unknown-tag error path, 
`try_with` duplicate-tag error, `with` duplicate-tag panic.
   - The example is runnable end-to-end (`cargo run --example 
physical_expr_serde`) and prints the JSON, the round-tripped expression, and 
the expected error when the registry doesn't know about the custom tag.
   - `cargo check --workspace` and `cargo clippy --all-targets --all-features 
-- -D warnings` (on touched crates) clean.
   
   ## Are there any user-facing changes?
   
   Yes, but additive:
   - Two new optional methods on the `PhysicalExpr` trait, both with defaults. 
Existing implementers (in-tree and out-of-tree) compile unchanged; they just 
won't be serializable through the new path until they opt in.
   - New public types: `PhysicalExprRegistry`, `PhysicalExprDeserialize`, 
`DeserializeContext`, `ExprSeed`, plus `default_registry()` in 
`datafusion-physical-expr`.
   - New library-user-guide page documenting the surface.
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to