adriangb opened a new pull request, #21993: URL: https://github.com/apache/datafusion/pull/21993
## Which issue does this PR close? - Closes #. (no tracking issue yet; happy to file one if this direction is wanted) ## Rationale for this change `PartitionedFile.extensions` is currently a single `Option<Arc<dyn Any + Send + Sync>>` slot. That means two independent components cannot both attach their own per-file data without colliding — only one of `ParquetAccessPlan`, a custom index entry, a custom reader payload, etc. can be present at a time. Anyone wanting more than one piece of metadata has to define a wrapper struct and have every consumer agree on it. While exploring this, I noticed DataFusion already has the same `HashMap<TypeId, Arc<dyn Any + Send + Sync>>` pattern in two other places: - `datafusion/execution/src/config.rs` — backs `SessionConfig.extensions`, with a clever `IdHasher` that avoids rehashing `TypeId`s. - `datafusion/physical-plan/src/operator_statistics/mod.rs` — backs `ExtendedStatistics.extensions`. Rather than add a third copy, this PR lifts the pattern into `datafusion-common` as `datafusion_common::extensions::Extensions` and uses it for `PartitionedFile`. `SessionConfig` and `ExtendedStatistics` can adopt it in follow-up PRs to remove the duplication; this PR keeps that scope out. ## What changes are included in this PR? - New `datafusion_common::extensions::Extensions` (TypeId-keyed `HashMap` with the `IdHasher` lifted from `SessionConfig`). API: `insert` / `insert_arc` / `get` / `get_arc` / `contains` / `remove` / `merge`, all generic over `T`. - `PartitionedFile.extensions` changes from `Option<Arc<dyn Any + Send + Sync>>` to `FileExtensions` (re-export of `Extensions`). - New typed builder methods: `with_extension<T>(T)` and `with_extension_arc<T>(Arc<T>)` replace `with_extensions(Arc<dyn Any + Send + Sync>)`. New accessor `extension::<T>()` avoids manual downcasts. - Parquet opener migrated: `create_initial_plan` now takes `&FileExtensions` and uses `get::<ParquetAccessPlan>()` directly instead of doing the downcast inline. - In-tree call sites migrated: `parquet/custom_reader.rs` test, `parquet/external_access_plan.rs` test, and the `parquet_advanced_index.rs` example. - Upgrade guide entry added in `docs/source/library-user-guide/upgrading/54.0.0.md`. ## Are these changes tested? - New unit tests for `Extensions` (insert/get/remove/replace, merge precedence) in `datafusion-common`. - Existing parquet tests that exercised the single-slot API (`parquet::custom_reader::route_data_access_ops_to_parquet_file_reader_factory`, all 10 `parquet::external_access_plan::*`) pass after migration. - Full workspace `cargo check --workspace --all-targets` clean. - `./dev/rust_lint.sh` clean. ## Are there any user-facing changes? Yes — breaking public-API change on `PartitionedFile`: ```diff -let pf = PartitionedFile::new(path, size) - .with_extensions(Arc::new(access_plan)); +let pf = PartitionedFile::new(path, size) + .with_extension(access_plan); ``` ```diff -let plan = pf.extensions - .as_ref() - .and_then(|ext| ext.downcast_ref::<ParquetAccessPlan>()); +let plan = pf.extension::<ParquetAccessPlan>(); ``` The `extensions` field's public type also changes (`Option<Arc<dyn Any>>` → `FileExtensions`). Upgrade guide entry added. ## Open questions - Should `SessionConfig.extensions` and `ExtendedStatistics.extensions` migrate to the shared `Extensions` type in this PR, or as follow-ups? I left them alone here to keep the change focused, but happy to fold them in if reviewers prefer. - Naming: `Extensions` vs `AnyMap` vs `TypeMap`. I went with `Extensions` to match the existing field names; open to bikeshedding. 🤖 Generated with [Claude Code](https://claude.com/claude-code) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
