adriangb opened a new pull request, #21993:
URL: https://github.com/apache/datafusion/pull/21993

   ## Which issue does this PR close?
   
   - Closes #.
   
   (no tracking issue yet; happy to file one if this direction is wanted)
   
   ## Rationale for this change
   
   `PartitionedFile.extensions` is currently a single `Option<Arc<dyn Any + 
Send + Sync>>` slot. That means two independent components cannot both attach 
their own per-file data without colliding — only one of `ParquetAccessPlan`, a 
custom index entry, a custom reader payload, etc. can be present at a time. 
Anyone wanting more than one piece of metadata has to define a wrapper struct 
and have every consumer agree on it.
   
   While exploring this, I noticed DataFusion already has the same 
`HashMap<TypeId, Arc<dyn Any + Send + Sync>>` pattern in two other places:
   
   - `datafusion/execution/src/config.rs` — backs `SessionConfig.extensions`, 
with a clever `IdHasher` that avoids rehashing `TypeId`s.
   - `datafusion/physical-plan/src/operator_statistics/mod.rs` — backs 
`ExtendedStatistics.extensions`.
   
   Rather than add a third copy, this PR lifts the pattern into 
`datafusion-common` as `datafusion_common::extensions::Extensions` and uses it 
for `PartitionedFile`. `SessionConfig` and `ExtendedStatistics` can adopt it in 
follow-up PRs to remove the duplication; this PR keeps that scope out.
   
   ## What changes are included in this PR?
   
   - New `datafusion_common::extensions::Extensions` (TypeId-keyed `HashMap` 
with the `IdHasher` lifted from `SessionConfig`). API: `insert` / `insert_arc` 
/ `get` / `get_arc` / `contains` / `remove` / `merge`, all generic over `T`.
   - `PartitionedFile.extensions` changes from `Option<Arc<dyn Any + Send + 
Sync>>` to `FileExtensions` (re-export of `Extensions`).
   - New typed builder methods: `with_extension<T>(T)` and 
`with_extension_arc<T>(Arc<T>)` replace `with_extensions(Arc<dyn Any + Send + 
Sync>)`. New accessor `extension::<T>()` avoids manual downcasts.
   - Parquet opener migrated: `create_initial_plan` now takes `&FileExtensions` 
and uses `get::<ParquetAccessPlan>()` directly instead of doing the downcast 
inline.
   - In-tree call sites migrated: `parquet/custom_reader.rs` test, 
`parquet/external_access_plan.rs` test, and the `parquet_advanced_index.rs` 
example.
   - Upgrade guide entry added in 
`docs/source/library-user-guide/upgrading/54.0.0.md`.
   
   ## Are these changes tested?
   
   - New unit tests for `Extensions` (insert/get/remove/replace, merge 
precedence) in `datafusion-common`.
   - Existing parquet tests that exercised the single-slot API 
(`parquet::custom_reader::route_data_access_ops_to_parquet_file_reader_factory`,
 all 10 `parquet::external_access_plan::*`) pass after migration.
   - Full workspace `cargo check --workspace --all-targets` clean.
   - `./dev/rust_lint.sh` clean.
   
   ## Are there any user-facing changes?
   
   Yes — breaking public-API change on `PartitionedFile`:
   
   ```diff
   -let pf = PartitionedFile::new(path, size)
   -    .with_extensions(Arc::new(access_plan));
   +let pf = PartitionedFile::new(path, size)
   +    .with_extension(access_plan);
   ```
   
   ```diff
   -let plan = pf.extensions
   -    .as_ref()
   -    .and_then(|ext| ext.downcast_ref::<ParquetAccessPlan>());
   +let plan = pf.extension::<ParquetAccessPlan>();
   ```
   
   The `extensions` field's public type also changes (`Option<Arc<dyn Any>>` → 
`FileExtensions`). Upgrade guide entry added.
   
   ## Open questions
   
   - Should `SessionConfig.extensions` and `ExtendedStatistics.extensions` 
migrate to the shared `Extensions` type in this PR, or as follow-ups? I left 
them alone here to keep the change focused, but happy to fold them in if 
reviewers prefer.
   - Naming: `Extensions` vs `AnyMap` vs `TypeMap`. I went with `Extensions` to 
match the existing field names; open to bikeshedding.
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to