adriangb opened a new pull request, #21996:
URL: https://github.com/apache/datafusion/pull/21996

   ## Which issue does this PR close?
   
   - Relates to #21624 (\`datafusion.execution.collect_statistics\` on wide 
tables).
   
   ## Rationale for this change
   
   Today, statistics flow through DataFusion as an all-or-nothing dense
   \`Statistics\` struct: \`collect_statistics=true\` reads parquet thrift 
footers
   for *every column of every file*, allocates a \`Vec<ColumnStatistics>\` of
   length \`num_columns\` per file, and stores it whether the query references
   those columns or not. On wide tables that's a lot of memory and IO that
   the query may not need.
   
   It also leaves third-party providers in an awkward spot. Delta, Iceberg,
   Hudi, and in-house catalogs already store per-column / per-file stats
   out-of-band — sometimes in dedicated columnar files, sometimes in
   manifests, sometimes in a metastore. Today they have no way to surface
   that to DataFusion's planner short of reconstructing a full dense
   \`Statistics\`, which defeats the point.
   
   This PR proposes a small handshake on the existing \`scan_with_args\` API
   that lets a caller ask a \`TableProvider\` for specific stats by name,
   and lets the provider answer only what it can deliver cheaply. The
   shape is intentionally minimal — enough to unblock memory and
   third-party wins on its own, with room to grow toward sketches,
   histograms, and selectivity stats later.
   
   This PR is opening as a **draft** to gather feedback on the API shape
   before we commit to it. The companion follow-up work (a sampled-scan
   fallback layer, parquet-side memory measurement, optimizer rule
   integration) lives on a separate branch and is not included here.
   
   ## What changes are included in this PR?
   
   ### New types in \`datafusion-common::stats\`
   
   \`\`\`rust
   pub enum StatisticsRequest {
       Min(Column), Max(Column),
       NullCount(Column), DistinctCount(Column),
       Sum(Column), ByteSize(Column),
       RowCount, TotalByteSize,
   }
   
   pub enum StatisticsValue {
       Scalar(Precision<ScalarValue>),
       Distribution(Arc<dyn Any + Send + Sync>),  // reserved for future use
       Sketch(Arc<dyn Any + Send + Sync>),        // reserved for future use
       Absent,
   }
   \`\`\`
   
   The variants of \`StatisticsRequest\` mirror the fields of \`Statistics\` /
   \`ColumnStatistics\`, so a provider that already populates one can answer
   the other trivially. Whether a value is exact or estimated travels in
   the \`Precision\` wrapper, not in the request kind itself —
   \`DistinctCount\` covers both an exact distinct count from a metadata
   catalog and an HLL-style estimate from a sampled scan.
   
   ### \`ScanArgs\` / \`ScanResult\` extension
   
   \`\`\`rust
   ScanArgs::with_statistics_requests(requests: Option<&[StatisticsRequest]>)
   ScanArgs::statistics_requests() -> Option<&[StatisticsRequest]>
   
   ScanResult::with_statistics(statistics: Vec<StatisticsValue>)
   ScanResult::statistics() -> &[StatisticsValue]
   ScanResult::into_parts() -> (Arc<dyn ExecutionPlan>, Vec<StatisticsValue>)
   \`\`\`
   
   The contract: \"answer what's free, leave the rest as \`Absent\`.\" The
   provider MUST NOT do expensive scans purely to satisfy these requests —
   the caller (e.g. a future sampled-stats helper, an optimizer rule, a
   diagnostic tool) decides what to do with the gaps.
   
   ### Per-file sparse stats
   
   \`\`\`rust
   PartitionedFile.satisfied_stats:
       Option<Arc<HashMap<StatisticsRequest, StatisticsValue>>>
   PartitionedFile::with_satisfied_stats(...)
   \`\`\`
   
   For per-file granularity. Memory scales with what was *asked for*
   (typically a handful of stats × the columns the query touches),
   not with table width. Providers that maintain per-file stats out-of-band
   can populate this directly without reconstructing a full dense
   \`Statistics\`.
   
   ### \`FilePruner\` consumes the sparse map
   
   \`FilePruner::file_stats_pruning\` becomes
   \`Box<dyn PruningStatistics + Send + Sync>\` so we can dispatch between:
   
   * \`PrunableStatistics\` — view of a dense \`PartitionedFile.statistics\`
     (existing path, unchanged behavior).
   * \`SparseFilePruningStats\` — a new adapter that, on each accessor,
     builds the corresponding \`StatisticsRequest\`, looks it up in the
     sparse map, and materializes the single-row array the pruning
     predicate needs. No densify-then-throw-away — the 1-row arrays are
     only ever materialized for columns the pruning predicate actually
     touches.
   
   \`FilePruner::try_new\` prefers \`statistics\` when present, falls back to
   \`satisfied_stats\`, returns \`None\` when neither is set.
   
   ### \`ListingTable\` answers from footer metadata
   
   When a caller passes \`with_statistics_requests(...)\`, \`scan_with_args\`
   populates \`ScanResult.statistics\` from the merged dense \`Statistics\` it
   already touched. 
\`Min\`/\`Max\`/\`NullCount\`/\`RowCount\`/\`TotalByteSize\` come
   back as the \`Precision\` value the format produced (\`Exact\` for
   parquet thrift footers). \`DistinctCount\`/\`Sum\`/\`ByteSize\` come back
   as \`Absent\` for parquet — those aren't in thrift footers; layered
   helpers can fill them.
   
   When the session has \`collect_statistics=false\`, the provider returns
   \`Absent\` for everything (the contract is \"answer what's free\").
   
   ## What's *not* in this PR
   
   The companion work on the
   [\`worktree-stats-mini-query-poc\` 
branch](https://github.com/pydantic/datafusion/tree/worktree-stats-mini-query-poc)
 shows what a
   consumer side of this API can look like — a sampled-scan helper that
   fills \`Absent\` slots with HLL NDVs, a \`TimeBoundedExec\` ceiling, parquet
   \`with_*_sampling\` primitives, and end-to-end benchmark integration. None
   of that lands here. This PR is just the API shape.
   
   ## Are these changes tested?
   
   * \`datafusion-common::stats::tests::statistics_request_is_hashable_keyable\`
     round-trips a request through a \`HashMap\` to confirm \`Hash + Eq\`.
   * \`datafusion-pruning::file_pruner::tests\` (3 tests) demonstrates
     end-to-end pruning against a sparse-only \`PartitionedFile\` (\`x > 100\`
     prunes a \`[10, 20]\` file, \`x > 15\` doesn't, no-stats-at-all returns
     \`None\`).
   
   \`cargo build --workspace\` and \`cargo test\` on the changed crates
   (\`datafusion-common\`, \`datafusion-catalog\`, 
\`datafusion-catalog-listing\`,
   \`datafusion-datasource\`, \`datafusion-pruning\`) are green.
   
   ## Are there any user-facing changes?
   
   API additions only, all opt-in:
   
   * \`ScanArgs\` / \`ScanResult\` gain new fields with \`Default\`-friendly
     initializers; existing callers that don't use the new builders see
     no change.
   * \`FilePruner\`'s field-type change is a private internal field.
   
   The only source-level break: \`PartitionedFile\` gains a new \`pub
   satisfied_stats: Option<...>\` field. Callers using
   \`PartitionedFile::new\` / \`From<ObjectMeta>\` / the existing builders are
   unaffected. Direct struct literals (uncommon, none in-tree) need to add
   \`satisfied_stats: None\` or migrate to the new \`with_satisfied_stats\`
   builder.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to