[PR] Add wide-schema benchmark suite for measuring per-file metadata overhead [datafusion]

via GitHub Thu, 30 Apr 2026 23:43:27 -0700


adriangb opened a new pull request, #21970:
URL: https://github.com/apache/datafusion/pull/21970


   ## Which issue does this PR close?
   
   N/A — opening as draft to discuss the benchmark shape before tying to an 
issue.
   
   ## Rationale for this change
   
   Adds a benchmark that reproduces a real-world wide-schema slowdown reported 
in production telemetry data: a simple selective query over many parquet files 
runs ~11× slower when the files carry hundreds of unused columns the query 
never touches. The cost lives almost entirely in per-file metadata setup 
(footer parsing, column-chunk metadata) and scales linearly with 
chunks-per-dataset, while predicate-evaluation phases (bloom filter, 
statistics, page index) stay flat regardless of schema width.
   
   The existing benchmarks don't exercise this shape — most TPC-H/ClickBench 
queries either touch many columns or filter heavily enough that scan work 
dominates. We need a focused benchmark so this regression is measurable in CI 
and so optimizations to the wide-schema scan path can be validated.
   
   ## What changes are included in this PR?
   
   Two new sql_benchmarks suites under `benchmarks/sql_benchmarks/`:
   
   - **`wide_tpch/`** — small-projection queries on a synthesized wide dataset 
(1024 cols × 256 files × 50k rows = ~225 MB).
   - **`narrow_tpch/`** — same SQL against a 7-col narrow dataset with 
identical row count, file count, and per-file row-group structure (~110 MB). 
The only variable between the two is schema width.
   
   A new binary, `gen_wide_data` (in `benchmarks/src/bin/`), synthesizes both 
datasets in ~60 s with no TPC-H source dependency. Schema is lineitem-shaped so 
the columns the focused queries reference (`l_orderkey`, `l_shipdate`, 
`l_extendedprice`, `l_comment`, `l_shipmode`, `l_returnflag`, `l_linestatus`) 
carry deterministic synthetic data; the remaining 9 columns plus all 
suffix-renamed copies are zero-filled. Zero rather than null because the 
parquet reader can shortcut on all-null statistics — the slowdown reproduces 
~35% wider with zero padding than with null.
   
   Headline query `Qrepro` matches the user-reported shape: filter on two 
low-cardinality string columns (`l_shipmode = 'AIR' AND l_returnflag = 'R'`) 
plus a non-stat-prunable modulo predicate (`AND l_orderkey % 1000 = 0`) for 
tight selectivity (~0.005% match rate), project two columns, no `LIMIT` or 
`ORDER BY`. Other queries in the suite cover predicates on duplicated columns 
far from the start of the schema (Q02, Q08, Q11), TopK (Q01), tight selectivity 
(Q03/Q10), and an `expect_plan` regression guard for projection pushdown (Q12).
   
   `bench.sh` additions:
   
   ```shell
   ./benchmarks/bench.sh data wide_tpch    # synthesizes both wide+narrow 
datasets, ~60s, ~335 MB total
   ./benchmarks/bench.sh run wide_tpch     # focused queries on the wide dataset
   ./benchmarks/bench.sh run narrow_tpch   # same SQL on narrow (control)
   ```
   
   ## Are these changes tested?
   
   Yes — measurements on a M-series Mac (12-way parallel scan, hot OS cache):
   
   **Criterion (3 s warmup, 10 samples):**
   
   | Query | narrow | wide | slowdown |
   |---|---|---|---|
   | **Qrepro** (string filter + tight selectivity) | 33.0 ms | 97.7 ms | 
**2.96×** |
   | Q03 (tight filter, LIMIT 1) | 2.3 ms | 9.7 ms | 4.20× |
   | Q10 (tight filter, no sort) | 3.1 ms | 11.4 ms | 3.63× |
   | Q01 (TopK shortcut) | 81.9 ms | 165.8 ms | 2.02× |
   
   **Cold-start `datafusion-cli` (Qrepro shape, median of 3):**
   
   | narrow | wide | slowdown |
   |---|---|---|
   | 60 ms | 1150 ms | **~19×** |
   
   **EXPLAIN ANALYZE phase deltas (Qrepro, cumulative across 12 scan tasks):**
   
   | Phase | narrow | wide | Δ |
   |---|---|---|---|
   | metadata_load_time | 6.1 ms | 732 ms | **120×** |
   | time_elapsed_opening | 2.7 ms | 54 ms | 20× |
   | time_elapsed_processing | 213 ms | 779 ms | 3.7× |
   | bloom_filter_eval_time | 1.5 ms | 1.8 ms | flat ✓ |
   | statistics_eval_time | 4.3 ms | 5.2 ms | flat ✓ |
   
   Same qualitative reproduction of the user-reported regression: 
`metadata_load_time` and per-file setup scale with column-chunk count, 
predicate-evaluation phases are flat regardless of schema width.
   
   `cargo fmt --all` and `cargo clippy --bin gen_wide_data --all-features -- -D 
warnings` are clean.
   
   ## Are there any user-facing changes?
   
   No public API changes. New benchmark suite + new utility binary under 
`benchmarks/`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] Add wide-schema benchmark suite for measuring per-file metadata overhead [datafusion]

Reply via email to