adriangb opened a new pull request, #22300:
URL: https://github.com/apache/datafusion/pull/22300
## Which issue does this PR close?
- Related to #21996
- Related to #21624
This is **not** a replacement for #21996 — it is a minimal subset of it,
carved out so the feature can be discussed/merged in smaller pieces.
## Rationale for this change
#21996 ("Query-aware statistics requests via ScanArgs / ScanResult") is a
full vertical slice: new statistics types, request threading optimizer →
planner → provider, a built-in `RequestStatistics` optimizer rule, and a
consumer integration (`FilePruner` / `ListingTable`).
This PR extracts **only the framework hooks** — just enough that the rest
can be implemented *entirely outside* of DataFusion. A third party can write
their own optimizer rule to derive statistics requests, and their own
`TableProvider` to consume them, without DataFusion shipping any rule or
consumer of its own.
In stock DataFusion nothing observable changes: no rule populates the new
field, and the built-in providers ignore it.
## What changes are included in this PR?
Five small, independently-reviewable commits:
1. **`refactor: add TableScanBuilder, deprecate TableScan::try_new`** —
`TableScan::try_new` takes five positional args and bare `TableScan { .. }`
literals are fragile to field additions. Introduce `TableScanBuilder` (with
`From<TableScan>`), move schema derivation into `build()`, deprecate `try_new`
(delegates to the builder), migrate all in-tree callers. Pure refactor.
2. **`feat: add StatisticsRequest / StatisticsValue / SatisfiedStatistics`**
— new public vocabulary types in `datafusion-expr-common::statistics`. Nothing
consumes them yet.
3. **`feat: add TableScan::statistics_requests field`** — an advisory
`Vec<StatisticsRequest>` on `TableScan`, settable via
`TableScan::with_statistics_requests` / `TableScanBuilder`. Empty by default;
DataFusion's own rules never populate it.
4. **`feat: thread statistics requests into ScanArgs`** — `ScanArgs` gains
`statistics_requests`; the physical planner threads
`TableScan::statistics_requests` into it so the request reaches
`TableProvider::scan_with_args`.
5. **`test: e2e statistics-request flow via a custom optimizer rule`** — an
integration test playing both external roles.
Deliberately **left out** vs #21996: the built-in `RequestStatistics`
optimizer rule, the `FilePruner` / `ListingTable` consumer integration, the
`PartitionedFile::satisfied_stats` per-file response field, and
`StatisticsValue::Distribution` (which would depend on the now-deprecated
`Distribution` type).
## Are these changes tested?
Yes:
- `datafusion-expr-common`: a unit test that `StatisticsRequest` is hashable
/ usable as a `HashMap` key.
- `datafusion/core/tests/user_defined/statistics_requests.rs`: an end-to-end
integration test where a custom `OptimizerRule` annotates `TableScan` and a
custom `TableProvider` asserts the requests reach `scan_with_args` — plus a
test that without such a rule the provider sees an empty request list.
- All existing `datafusion-expr` / `datafusion-optimizer` /
`datafusion-proto` tests pass against the `TableScanBuilder` refactor.
## Are there any user-facing changes?
Yes — this needs the `api change` label:
- New public types `StatisticsRequest`, `StatisticsValue`,
`SatisfiedStatistics` (re-exported via `datafusion_expr::statistics`).
- New `TableScanBuilder`; `TableScan::try_new` is **deprecated** (still
works, delegates to the builder).
- `TableScan` gains a new public field `statistics_requests` — this breaks
exhaustive `TableScan { .. }` struct literals downstream (the recommended fix
is `TableScanBuilder`).
- `ScanArgs` gains `with_statistics_requests` / `statistics_requests`.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]