adriangb commented on issue #21835:
URL: https://github.com/apache/datafusion/issues/21835#issuecomment-4348350257
## Follow-up: Consolidate proto types and remove `FromProto`/`TryFromProto`
In #21929 the proto-models extraction ran into the orphan rule on a number
of `impl From<&protobuf::X> for Y` blocks where both sides became foreign to
`datafusion-proto`. We worked around this with a pair of local `FromProto<T>` /
`TryFromProto<T>` traits in a new `datafusion_proto::convert` module, mirroring
`From` / `TryFrom`. That gets the PR to compile and preserves wire
compatibility, but it's a workaround rather than a clean architecture: callers
see `Y::from_proto(p)` instead of `Y::from(p)`, and the impls live in
`datafusion-proto` rather than next to the target types they convert into.
This issue tracks the follow-up cleanup that gets us back to standard
`From`/`TryFrom` everywhere, with each impl living next to either its target
type or its proto type.
### What blocks doing it in #21929
The blocker is dep-graph cycles, not the orphan rule by itself. Each option
that keeps standard `From`/`TryFrom` syntax requires *somewhere* the impl can
live where one of the types is local. The natural homes (the target-type crates
with a `proto` feature) cycle today:
```
datafusion-proto-models ──depends on──▶ datafusion-proto-common
──depends on──▶ datafusion-common
▲
│
└────────────────────── if datafusion-common[proto] depends on it
─────────────────────┘
(cycle)
```
So `datafusion-common[proto] → datafusion-proto-models` is rejected by cargo
because `datafusion-proto-models → datafusion-proto-common → datafusion-common`
already exists in the other direction.
This is the same cycle re-emerging anywhere we'd want a target-type crate to
host From/TryFrom impls touching `datafusion-common` types.
### The clean architecture
Collapse `datafusion-proto-common` into `datafusion-proto-models` and push
impls down to target-type crates:
| Crate | Owns | Deps |
|---|---|---|
| `datafusion-proto-models` | All prost-generated types from both `.proto`
files. **Arrow** conversion impls (which must live here because `arrow::*` is
foreign and `proto-models` is the only crate where the proto type is local).
Optional `pbjson`/`serde`. | `prost`, `arrow`, no datafusion-* crates |
| `datafusion-common[proto]` | `From`/`TryFrom` impls between proto types
and `datafusion-common` types (~25 impls currently in `proto-common`'s
`from_proto.rs`/`to_proto.rs`). | adds optional dep on `proto-models` |
| `datafusion-expr[proto]` | Impls for `WindowFrame`, `WindowFrameBound`,
`WindowFrameUnits`, `WriteOp`, `NullTreatment`, `JoinConstraint` (~6 from
`datafusion-proto`'s `convert.rs`). | adds optional dep on `proto-models` |
| `datafusion-datasource[proto]` | Impls for `PartitionedFile`, `FileGroup`,
`FileRange`, `FileSinkConfig` (~4 from `datafusion-proto`). | adds optional dep
on `proto-models` |
| `datafusion-datasource-{csv,json,parquet}[proto]` | `JsonSink`, `CsvSink`,
`ParquetSink` impls (3, one each). | adds optional dep on `proto-models` |
| `datafusion-proto` | Encode/decode orchestration only. No `convert.rs`, no
`FromProto`/`TryFromProto` traits, no `From` impls. | unchanged |
| `datafusion-proto-common` | **Deleted**, or shrunk to a thin re-export
shim if back-compat matters. | n/a |
Cycle check after the move:
- `datafusion-common[proto] → proto-models → {prost, arrow}` — proto-models
has no datafusion deps, no cycle. ✓
- `datafusion-expr[proto] → proto-models` — same, no cycle. ✓
- `datafusion-datasource[proto] → proto-models` — same, no cycle. Even with
the eventual `physical-expr-common[proto] → proto-models` edge from #21929,
there's no path back from proto-models to a datafusion crate. ✓
- All downstream crates can host their own impls under standard
`From`/`TryFrom`.
End state for callers:
```rust
// Today (workaround in #21929):
let y = Y::try_from_proto(&p)?;
let p = protobuf::X::from_proto(y);
// After this cleanup:
let y: Y = (&p).try_into()?;
let p: protobuf::X = y.into();
```
### Why it's a separate PR
The work is mechanical but voluminous:
1. **Make `proto-models` standalone** — drop dep on `proto-common`, add an
in-crate `pub mod datafusion_common` to resolve `super::datafusion_common::Foo`
references in the prost-generated code, regenerate `pbjson` for both
`.datafusion` and `.datafusion_common` packages so the embedded common types
get serde impls under the `json` feature.
2. **Move arrow conversion impls into `proto-models`** — about ~6 impl
blocks (Schema, Field, ArrowType, DataType, TimeUnit, IntervalUnit) currently
in `proto-common/src/from_proto/mod.rs` and `to_proto/mod.rs`. These must live
in `proto-models` because both `arrow::Schema` and `proto::Schema` are foreign
to every other crate. Add `arrow` dep.
3. **Move `datafusion-common` impls** — the remaining ~25 impls in
`proto-common`'s from/to_proto modules go into a new `datafusion-common[proto]`
module. `datafusion-common` gains an optional `proto-models` dep under the
`proto` feature.
4. **Update `datafusion-ffi`** — currently uses
`datafusion_proto_common::from_proto::parse_proto_fields_to_fields` and
`datafusion_proto_common::Field`. Migrate to the new locations.
5. **Update `datafusion-proto`** — its `protobuf` module currently
re-exports the explicit list `Schema, ScalarValue, ArrowType, ...` from
`proto-common`'s `protobuf_common`, plus a wildcard from `proto-models`. After
consolidation it's just `pub use proto_models::generated::datafusion::*` (and
the equivalent for the embedded common types). All ~280 callsites that go
through `proto-common`'s types via `let s: protobuf::Schema =
arrow.try_into()?` continue to compile transparently because `protobuf::Schema`
now resolves to a single type.
6. **Move plan-level impls** — `datafusion-proto/src/convert.rs` deletes;
the ~13 impls there relocate to `datafusion-expr[proto]`,
`datafusion-datasource[proto]`, and the file-format crates. ~50 callsites in
`datafusion-proto` switch from `Y::from_proto(p)` back to `Y::from(p)`.
7. **Delete `proto-common`** (or stub it as a thin re-export shim and
`#[deprecated]` it).
Verified mid-attempt during #21929: the standalone-`proto-models` step works
(`cargo check -p datafusion-proto-models --features json` passes), but it
surfaces ~280 type mismatches in `datafusion-proto` because of two parallel
`Schema` types (`proto-common`'s and `proto-models`'s embedded copy) — those
mismatches go away once the consolidation in step 5 is done. So the steps must
land together; you can't do step 1 without doing through step 5 in the same PR.
### Sequencing in a single follow-up PR
Recommended commit shape (4-6 hours of careful work):
1. `Make datafusion-proto-models standalone` — split `datafusion_common`
into a sibling module, drop proto-common dep, regenerate pbjson with both
packages.
2. `Move arrow conversion impls into proto-models` — extract from
proto-common.
3. `Add proto feature to datafusion-common, move target-type impls there` —
extract the rest from proto-common.
4. `Migrate datafusion-ffi to the new paths` — small, ~5 imports.
5. `Use proto-models types throughout datafusion-proto, drop
FromProto/TryFromProto` — this is where the ~280 mismatches resolve and
`convert.rs` deletes.
6. `Add proto feature to datafusion-expr/datasource/{csv,json,parquet}, move
plan-level impls` — relocates the ~13 impls in datafusion-proto.
7. `Delete (or stub) datafusion-proto-common` — depending on whether we want
a back-compat shim.
Each commit should keep the workspace compiling. Tests stay green throughout
because wire format is preserved.
### Out-of-scope notes
- Wire compatibility is preserved — this is purely a Rust-level refactor,
the prost types and serialized bytes are unchanged.
- The `PhysicalPlanNodeExt` trait introduced in #21929 (for
inherent-on-foreign-type orphan-rule cases) is unrelated and stays. Its
public-API break (`use datafusion_proto::physical_plan::PhysicalPlanNodeExt;`)
is genuinely required.
- The `to_proto` hook on `PhysicalExpr` (the actual goal of this issue) is
delivered in #21929 and unaffected by this follow-up.
cc #21929
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]