adriangb opened a new pull request, #21996:
URL: https://github.com/apache/datafusion/pull/21996
## Which issue does this PR close?
- Relates to #21624 (\`datafusion.execution.collect_statistics\` on wide
tables).
## Rationale for this change
Today, statistics flow through DataFusion as an all-or-nothing dense
\`Statistics\` struct: \`collect_statistics=true\` reads parquet thrift
footers
for *every column of every file*, allocates a \`Vec<ColumnStatistics>\` of
length \`num_columns\` per file, and stores it whether the query references
those columns or not. On wide tables that's a lot of memory and IO that
the query may not need.
It also leaves third-party providers in an awkward spot. Delta, Iceberg,
Hudi, and in-house catalogs already store per-column / per-file stats
out-of-band — sometimes in dedicated columnar files, sometimes in
manifests, sometimes in a metastore. Today they have no way to surface
that to DataFusion's planner short of reconstructing a full dense
\`Statistics\`, which defeats the point.
This PR proposes a small handshake on the existing \`scan_with_args\` API
that lets a caller ask a \`TableProvider\` for specific stats by name,
and lets the provider answer only what it can deliver cheaply. The
shape is intentionally minimal — enough to unblock memory and
third-party wins on its own, with room to grow toward sketches,
histograms, and selectivity stats later.
This PR is opening as a **draft** to gather feedback on the API shape
before we commit to it. The companion follow-up work (a sampled-scan
fallback layer, parquet-side memory measurement, optimizer rule
integration) lives on a separate branch and is not included here.
## What changes are included in this PR?
### New types in \`datafusion-common::stats\`
\`\`\`rust
pub enum StatisticsRequest {
Min(Column), Max(Column),
NullCount(Column), DistinctCount(Column),
Sum(Column), ByteSize(Column),
RowCount, TotalByteSize,
}
pub enum StatisticsValue {
Scalar(Precision<ScalarValue>),
Distribution(Arc<dyn Any + Send + Sync>), // reserved for future use
Sketch(Arc<dyn Any + Send + Sync>), // reserved for future use
Absent,
}
\`\`\`
The variants of \`StatisticsRequest\` mirror the fields of \`Statistics\` /
\`ColumnStatistics\`, so a provider that already populates one can answer
the other trivially. Whether a value is exact or estimated travels in
the \`Precision\` wrapper, not in the request kind itself —
\`DistinctCount\` covers both an exact distinct count from a metadata
catalog and an HLL-style estimate from a sampled scan.
### \`ScanArgs\` / \`ScanResult\` extension
\`\`\`rust
ScanArgs::with_statistics_requests(requests: Option<&[StatisticsRequest]>)
ScanArgs::statistics_requests() -> Option<&[StatisticsRequest]>
ScanResult::with_statistics(statistics: Vec<StatisticsValue>)
ScanResult::statistics() -> &[StatisticsValue]
ScanResult::into_parts() -> (Arc<dyn ExecutionPlan>, Vec<StatisticsValue>)
\`\`\`
The contract: \"answer what's free, leave the rest as \`Absent\`.\" The
provider MUST NOT do expensive scans purely to satisfy these requests —
the caller (e.g. a future sampled-stats helper, an optimizer rule, a
diagnostic tool) decides what to do with the gaps.
### Per-file sparse stats
\`\`\`rust
PartitionedFile.satisfied_stats:
Option<Arc<HashMap<StatisticsRequest, StatisticsValue>>>
PartitionedFile::with_satisfied_stats(...)
\`\`\`
For per-file granularity. Memory scales with what was *asked for*
(typically a handful of stats × the columns the query touches),
not with table width. Providers that maintain per-file stats out-of-band
can populate this directly without reconstructing a full dense
\`Statistics\`.
### \`FilePruner\` consumes the sparse map
\`FilePruner::file_stats_pruning\` becomes
\`Box<dyn PruningStatistics + Send + Sync>\` so we can dispatch between:
* \`PrunableStatistics\` — view of a dense \`PartitionedFile.statistics\`
(existing path, unchanged behavior).
* \`SparseFilePruningStats\` — a new adapter that, on each accessor,
builds the corresponding \`StatisticsRequest\`, looks it up in the
sparse map, and materializes the single-row array the pruning
predicate needs. No densify-then-throw-away — the 1-row arrays are
only ever materialized for columns the pruning predicate actually
touches.
\`FilePruner::try_new\` prefers \`statistics\` when present, falls back to
\`satisfied_stats\`, returns \`None\` when neither is set.
### \`ListingTable\` answers from footer metadata
When a caller passes \`with_statistics_requests(...)\`, \`scan_with_args\`
populates \`ScanResult.statistics\` from the merged dense \`Statistics\` it
already touched.
\`Min\`/\`Max\`/\`NullCount\`/\`RowCount\`/\`TotalByteSize\` come
back as the \`Precision\` value the format produced (\`Exact\` for
parquet thrift footers). \`DistinctCount\`/\`Sum\`/\`ByteSize\` come back
as \`Absent\` for parquet — those aren't in thrift footers; layered
helpers can fill them.
When the session has \`collect_statistics=false\`, the provider returns
\`Absent\` for everything (the contract is \"answer what's free\").
## What's *not* in this PR
The companion work on the
[\`worktree-stats-mini-query-poc\`
branch](https://github.com/pydantic/datafusion/tree/worktree-stats-mini-query-poc)
shows what a
consumer side of this API can look like — a sampled-scan helper that
fills \`Absent\` slots with HLL NDVs, a \`TimeBoundedExec\` ceiling, parquet
\`with_*_sampling\` primitives, and end-to-end benchmark integration. None
of that lands here. This PR is just the API shape.
## Are these changes tested?
* \`datafusion-common::stats::tests::statistics_request_is_hashable_keyable\`
round-trips a request through a \`HashMap\` to confirm \`Hash + Eq\`.
* \`datafusion-pruning::file_pruner::tests\` (3 tests) demonstrates
end-to-end pruning against a sparse-only \`PartitionedFile\` (\`x > 100\`
prunes a \`[10, 20]\` file, \`x > 15\` doesn't, no-stats-at-all returns
\`None\`).
\`cargo build --workspace\` and \`cargo test\` on the changed crates
(\`datafusion-common\`, \`datafusion-catalog\`,
\`datafusion-catalog-listing\`,
\`datafusion-datasource\`, \`datafusion-pruning\`) are green.
## Are there any user-facing changes?
API additions only, all opt-in:
* \`ScanArgs\` / \`ScanResult\` gain new fields with \`Default\`-friendly
initializers; existing callers that don't use the new builders see
no change.
* \`FilePruner\`'s field-type change is a private internal field.
The only source-level break: \`PartitionedFile\` gains a new \`pub
satisfied_stats: Option<...>\` field. Callers using
\`PartitionedFile::new\` / \`From<ObjectMeta>\` / the existing builders are
unaffected. Direct struct literals (uncommon, none in-tree) need to add
\`satisfied_stats: None\` or migrate to the new \`with_satisfied_stats\`
builder.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]