[PR] feat: rewrite standalone shuffle_bench to drive real Parquet input [datafusion-ballista]

via GitHub Mon, 27 Apr 2026 15:03:49 -0700


andygrove opened a new pull request, #1600:
URL: https://github.com/apache/datafusion-ballista/pull/1600


   # Which issue does this PR close?
   
   Closes #.
   
   # Rationale for this change
   
   The existing `benchmarks/src/bin/shuffle_bench` is a synthetic-data tool 
that already overlaps with the `sort_shuffle` Criterion bench at 
`benchmarks/benches/sort_shuffle.rs`. Both run the same hash-vs-sort A/B 
against in-memory generated data.
   
   What's missing is a way to drive the shuffle writers end-to-end against real 
Parquet data — useful for TPC-H-scale profiling, flamegraph runs, and 
confirming that improvements measured on synthetic data hold on 
production-shaped inputs.
   
   This PR replaces `shuffle_bench` with a Parquet-driven runner modeled after 
[DataFusion Comet's 
`shuffle_bench`](https://github.com/apache/datafusion-comet/blob/main/native/shuffle/src/bin/shuffle_bench.rs).
   
   # What changes are included in this PR?
   
   - **`benchmarks/src/bin/shuffle_bench.rs`** — full rewrite. Streams input 
from a Parquet file or directory via `ctx.read_parquet` (with optional 
`CoalescePartitionsExec` if the parquet plan emits multiple partitions). New 
flags:
     - `--input <path>` — Parquet file or directory.
     - `--writer hash|sort` — selects either `ShuffleWriterExec` or 
`SortShuffleWriterExec`.
     - `--partitioning hash` — partitioning scheme. `single` and `round-robin` 
are rejected at startup with a clear error and exit code 2 (neither writer 
supports them today).
     - `--hash-columns 0,3` — column indices to hash on.
     - `--partitions <N>` — output partition count (default 200).
     - `--batch-size <rows>` — DataFusion target batch size (default 8192).
     - `--memory-limit <bytes>` — applied to 
`RuntimeEnvBuilder::with_memory_limit` (governs Parquet decoding and other 
memory-aware operators) and to the sort writer's internal `memory_limit` field 
(governs spill-to-disk).
     - `--limit <rows>` — cap rows read from input.
     - `--iterations <N>` and `--warmup <N>` — timing loop control.
     - `--concurrent-tasks <N>` — spawn N tokio tasks each running their own 
shuffle write, simulating executor parallelism.
     - `--output-dir <path>` — work directory for shuffle data (cleaned up at 
exit).
   - **`benchmarks/Cargo.toml`** — adds `clap = { workspace = true }`. 
`structopt` is retained because `tpch.rs` and `nyctaxi.rs` still use it.
   - **`Cargo.lock`** — updated for the new `clap` dep on `ballista-benchmarks`.
   
   The Criterion bench at `benchmarks/benches/sort_shuffle.rs` is left 
untouched; it continues to cover the synthetic-data A/B comparison.
   
   ## Sample run
   
   ```sh
   cargo run --release --bin shuffle_bench -- \
     --input /data/tpch/sf100/lineitem \
     --writer sort --partitions 200 --hash-columns 0,3 \
     --limit 1000000 --warmup 1 --iterations 3
   ```
   
   Output:
   ```
   === Ballista Shuffle Benchmark ===
   Writer:         Sort
   Partitioning:   Hash
   Input:          /data/tpch/sf100/lineitem
   Schema:         16 cols (3xdate, 4xdecimal, 4xint, 5xstring)
   Total rows:     1000000
   Partitions:     200
   Batch size:     8192
   Iterations:     3 (warmup 1)
   
     [warmup 1/1] write: 46.756s
     [iter 1/3] write: 45.688s
     [iter 2/3] write: 45.057s
     [iter 3/3] write: 44.350s
   
   === Results ===
   avg time: 45.032s
   throughput: 22206 rows/s (total across 1 tasks)
   
   Shuffle metrics (last iteration):
     input_rows: 1000000
     write_time: 44.268s (98.3%)
     repart_time: 0.006s (0.0%)
     output_rows: 1000000
     spill_time: 0.000s (0.0%)
   ```
   
   # Are there any user-facing changes?
   
   The CLI surface of `shuffle_bench` changes (different flags, switched from 
`structopt` to `clap`). This binary is a developer profiling tool, not a public 
API; anyone scripting against the previous synthetic flags (`--rows`, 
`--input-partitions`, `--hash-only`, `--sort-only`, etc.) will need to migrate 
to the new `--input` / `--writer` form.
   
   No library or wire-format changes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] feat: rewrite standalone shuffle_bench to drive real Parquet input [datafusion-ballista]

Reply via email to