adriangb opened a new issue, #21968: URL: https://github.com/apache/datafusion/issues/21968
There is some great work going on in arrow-rs (https://github.com/apache/arrow-rs/issues/9722) by @HippoBaro to make wide schemas (schemas with many columns) more performant. We have similarly wide data and spend a lot of time in our queries loading parquet metadata. I spent some time investigating this today and came to the conclusion that not only is it expensive to load wide schemas, they also introduce overhead in the rest of the query. Below is a summary of the analysis (AI generated). This makes me think we need to tackle the problem from the DataFusion side as well, with the first step being **adding a benchmark for wide schema queries**. ## TL;DR A simple selective query over 339 Parquet files runs **~11× slower** when the files carry hundreds of columns the query never touches, even though every unused column is correctly projected away and every unused row group is correctly pruned. `metadata_load_time` is the cleanest signal but only ~10% of the wall-clock overhead. The rest sits in downstream per-file setup that scales with schema width — most of it not explicitly bucketed by `EXPLAIN ANALYZE`. ## Setup - Two copies of the same 339 Parquet files, same rows, same row groups. - **wide**: full production schema (681 unique schema paths, 163,926 column chunks across the dataset). - **narrow**: only the 4 columns the query references (4 schema paths, 1,356 column chunks). - Engine: `datafusion-cli 53.0.0`, local disk, hot OS cache, 12-way scan parallelism. - Query: ```sql SELECT start_timestamp, trace_id FROM '<path>/*.parquet' WHERE service_name = 'api' AND span_name = 'Report usage event'; ``` Returns 1,287 rows out of ~12.1 M scanned. Pruning behaviour is identical on both datasets: 350→350 files matched on stats, 339→339 row groups matched on stats, 339→320 row groups after bloom-filter pruning. ## Wall time (3 hot runs each) | Dataset | run 1 | run 2 | run 3 | median | |---|---|---|---|---| | narrow | 0.053 s | 0.046 s | 0.049 s | **0.049 s** | | wide | 0.548 s | 0.556 s | 0.557 s | **0.557 s** | ## EXPLAIN ANALYZE — DataSourceExec metrics Sums across 12 scan tasks. Same `output_bytes=1266.2 MB`, same final row count, identical pruning. | Phase | narrow | wide | Δ | |---|---|---|---| | **metadata_load_time** | 8.4 ms | 580 ms | **+572 ms** | | time_elapsed_opening | 27 ms | 161 ms | +134 ms | | time_elapsed_scanning_until_data | 308 ms | 526 ms | +218 ms | | time_elapsed_processing | 307 ms | 737 ms | +430 ms | | time_elapsed_scanning_total | 391 ms | 755 ms | +364 ms | | bloom_filter_eval_time | 349 ms | 284 ms | -65 ms | | statistics_eval_time | 4.6 ms | 7.0 ms | +2.4 ms | | page_index_eval_time | 13 µs | 14 µs | ~0 | | FilterExec elapsed_compute | 17.7 ms | 17.3 ms | ~0 | ## Where the cost lands Wall-clock delta is **+508 ms** going narrow → wide. With ~12-way parallelism, dividing the cumulative metrics by 12 gives a rough wall-clock estimate per phase: | Phase | Cumulative Δ | ~Wall Δ | |---|---|---| | metadata_load_time | +572 ms | ~48 ms | | time_elapsed_opening | +134 ms | ~11 ms | | time_elapsed_processing (excl. metadata + opening) | ~+296 ms | ~25 ms | | Unattributed (serial setup, expression binding against wide schema, scan-task dispatch) | — | **~420 ms** | ## Observations 1. **`metadata_load_time` scales linearly with column-chunk count.** ~3.5 µs per chunk in both datasets (8 ms / 1,356 ≈ 580 ms / 163,926). The per-chunk cost is layout-independent; what changes is how many chunks the footer describes. 2. **The wide schema keeps costing after metadata is loaded.** `time_elapsed_processing` more than doubles and `time_elapsed_scanning_until_data` grows by 218 ms even though the projected columns are identical. That is per-column reader-state construction, schema-descriptor traversal, and predicate binding against a 681-element schema running per file before the first useful byte flows. 3. **Predicate evaluation phases are flat.** `bloom_filter_eval_time`, `statistics_eval_time`, `page_index_eval_time`, `FilterExec.elapsed_compute` — none of them respond to schema width. Once row groups are selected, the unused columns do not participate. 4. **Most of the wall-clock gap is unattributed by `EXPLAIN ANALYZE`.** The bucketed phases sum to ~85 ms wall; the observed delta is 508 ms wall. The remaining ~420 ms is serial coordinator/planner work that scales with schema width and is not broken out in current metrics. On object storage this is likely amplified further because every file open also pays a network round-trip proportional to footer size. ## Conclusion The narrow result (~49 ms) is roughly what this query *should* cost. Anything above that is the price of carrying schema descriptions for columns the query never touched through the planner and scanner. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
