[I] Add benchmarks for queries against wide schema parquet files [datafusion]

via GitHub Thu, 30 Apr 2026 20:32:36 -0700


adriangb opened a new issue, #21968:
URL: https://github.com/apache/datafusion/issues/21968


   There is some great work going on in arrow-rs 
(https://github.com/apache/arrow-rs/issues/9722) by @HippoBaro to make wide 
schemas (schemas with many columns) more performant.
   
   We have similarly wide data and spend a lot of time in our queries loading 
parquet metadata. I spent some time investigating this today and came to the 
conclusion that not only is it expensive to load wide schemas, they also 
introduce overhead in the rest of the query. Below is a summary of the analysis 
(AI generated). This makes me think we need to tackle the problem from the 
DataFusion side as well, with the first step being **adding a benchmark for 
wide schema queries**.
   
   ## TL;DR
   
   A simple selective query over 339 Parquet files runs **~11× slower** when the
   files carry hundreds of columns the query never touches, even though every
   unused column is correctly projected away and every unused row group is
   correctly pruned.
   
   `metadata_load_time` is the cleanest signal but only ~10% of the wall-clock
   overhead. The rest sits in downstream per-file setup that scales with schema
   width — most of it not explicitly bucketed by `EXPLAIN ANALYZE`.
   
   ## Setup
   
   - Two copies of the same 339 Parquet files, same rows, same row groups.
   - **wide**: full production schema (681 unique schema paths, 163,926 column
     chunks across the dataset).
   - **narrow**: only the 4 columns the query references (4 schema paths, 1,356
     column chunks).
   - Engine: `datafusion-cli 53.0.0`, local disk, hot OS cache, 12-way scan
     parallelism.
   - Query:
   
     ```sql
     SELECT start_timestamp, trace_id
     FROM '<path>/*.parquet'
     WHERE service_name = 'api' AND span_name = 'Report usage event';
     ```
   
     Returns 1,287 rows out of ~12.1 M scanned. Pruning behaviour is identical
     on both datasets: 350→350 files matched on stats, 339→339 row groups
     matched on stats, 339→320 row groups after bloom-filter pruning.
   
   ## Wall time (3 hot runs each)
   
   | Dataset | run 1 | run 2 | run 3 | median |
   |---|---|---|---|---|
   | narrow | 0.053 s | 0.046 s | 0.049 s | **0.049 s** |
   | wide   | 0.548 s | 0.556 s | 0.557 s | **0.557 s** |
   
   ## EXPLAIN ANALYZE — DataSourceExec metrics
   
   Sums across 12 scan tasks. Same `output_bytes=1266.2 MB`, same final row
   count, identical pruning.
   
   | Phase | narrow | wide | Δ |
   |---|---|---|---|
   | **metadata_load_time** | 8.4 ms | 580 ms | **+572 ms** |
   | time_elapsed_opening | 27 ms | 161 ms | +134 ms |
   | time_elapsed_scanning_until_data | 308 ms | 526 ms | +218 ms |
   | time_elapsed_processing | 307 ms | 737 ms | +430 ms |
   | time_elapsed_scanning_total | 391 ms | 755 ms | +364 ms |
   | bloom_filter_eval_time | 349 ms | 284 ms | -65 ms |
   | statistics_eval_time | 4.6 ms | 7.0 ms | +2.4 ms |
   | page_index_eval_time | 13 µs | 14 µs | ~0 |
   | FilterExec elapsed_compute | 17.7 ms | 17.3 ms | ~0 |
   
   ## Where the cost lands
   
   Wall-clock delta is **+508 ms** going narrow → wide. With ~12-way 
parallelism,
   dividing the cumulative metrics by 12 gives a rough wall-clock estimate per
   phase:
   
   | Phase | Cumulative Δ | ~Wall Δ |
   |---|---|---|
   | metadata_load_time | +572 ms | ~48 ms |
   | time_elapsed_opening | +134 ms | ~11 ms |
   | time_elapsed_processing (excl. metadata + opening) | ~+296 ms | ~25 ms |
   | Unattributed (serial setup, expression binding against wide schema, 
scan-task dispatch) | — | **~420 ms** |
   
   ## Observations
   
   1. **`metadata_load_time` scales linearly with column-chunk count.** ~3.5 µs
      per chunk in both datasets (8 ms / 1,356 ≈ 580 ms / 163,926). The
      per-chunk cost is layout-independent; what changes is how many chunks the
      footer describes.
   
   2. **The wide schema keeps costing after metadata is loaded.**
      `time_elapsed_processing` more than doubles and 
`time_elapsed_scanning_until_data`
      grows by 218 ms even though the projected columns are identical. That is
      per-column reader-state construction, schema-descriptor traversal, and
      predicate binding against a 681-element schema running per file before
      the first useful byte flows.
   
   3. **Predicate evaluation phases are flat.**  `bloom_filter_eval_time`,
      `statistics_eval_time`, `page_index_eval_time`, 
`FilterExec.elapsed_compute`
      — none of them respond to schema width. Once row groups are selected, the
      unused columns do not participate.
   
   4. **Most of the wall-clock gap is unattributed by `EXPLAIN ANALYZE`.** The
      bucketed phases sum to ~85 ms wall; the observed delta is 508 ms wall.
      The remaining ~420 ms is serial coordinator/planner work that scales with
      schema width and is not broken out in current metrics. On object storage
      this is likely amplified further because every file open also pays a
      network round-trip proportional to footer size.
   
   ## Conclusion
   
   The narrow result (~49 ms) is roughly what this query *should* cost. Anything
   above that is the price of carrying schema descriptions for columns the query
   never touched through the planner and scanner.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Add benchmarks for queries against wide schema parquet files [datafusion]

Reply via email to