andygrove opened a new pull request, #1600: URL: https://github.com/apache/datafusion-ballista/pull/1600
# Which issue does this PR close? Closes #. # Rationale for this change The existing `benchmarks/src/bin/shuffle_bench` is a synthetic-data tool that already overlaps with the `sort_shuffle` Criterion bench at `benchmarks/benches/sort_shuffle.rs`. Both run the same hash-vs-sort A/B against in-memory generated data. What's missing is a way to drive the shuffle writers end-to-end against real Parquet data — useful for TPC-H-scale profiling, flamegraph runs, and confirming that improvements measured on synthetic data hold on production-shaped inputs. This PR replaces `shuffle_bench` with a Parquet-driven runner modeled after [DataFusion Comet's `shuffle_bench`](https://github.com/apache/datafusion-comet/blob/main/native/shuffle/src/bin/shuffle_bench.rs). # What changes are included in this PR? - **`benchmarks/src/bin/shuffle_bench.rs`** — full rewrite. Streams input from a Parquet file or directory via `ctx.read_parquet` (with optional `CoalescePartitionsExec` if the parquet plan emits multiple partitions). New flags: - `--input <path>` — Parquet file or directory. - `--writer hash|sort` — selects either `ShuffleWriterExec` or `SortShuffleWriterExec`. - `--partitioning hash` — partitioning scheme. `single` and `round-robin` are rejected at startup with a clear error and exit code 2 (neither writer supports them today). - `--hash-columns 0,3` — column indices to hash on. - `--partitions <N>` — output partition count (default 200). - `--batch-size <rows>` — DataFusion target batch size (default 8192). - `--memory-limit <bytes>` — applied to `RuntimeEnvBuilder::with_memory_limit` (governs Parquet decoding and other memory-aware operators) and to the sort writer's internal `memory_limit` field (governs spill-to-disk). - `--limit <rows>` — cap rows read from input. - `--iterations <N>` and `--warmup <N>` — timing loop control. - `--concurrent-tasks <N>` — spawn N tokio tasks each running their own shuffle write, simulating executor parallelism. - `--output-dir <path>` — work directory for shuffle data (cleaned up at exit). - **`benchmarks/Cargo.toml`** — adds `clap = { workspace = true }`. `structopt` is retained because `tpch.rs` and `nyctaxi.rs` still use it. - **`Cargo.lock`** — updated for the new `clap` dep on `ballista-benchmarks`. The Criterion bench at `benchmarks/benches/sort_shuffle.rs` is left untouched; it continues to cover the synthetic-data A/B comparison. ## Sample run ```sh cargo run --release --bin shuffle_bench -- \ --input /data/tpch/sf100/lineitem \ --writer sort --partitions 200 --hash-columns 0,3 \ --limit 1000000 --warmup 1 --iterations 3 ``` Output: ``` === Ballista Shuffle Benchmark === Writer: Sort Partitioning: Hash Input: /data/tpch/sf100/lineitem Schema: 16 cols (3xdate, 4xdecimal, 4xint, 5xstring) Total rows: 1000000 Partitions: 200 Batch size: 8192 Iterations: 3 (warmup 1) [warmup 1/1] write: 46.756s [iter 1/3] write: 45.688s [iter 2/3] write: 45.057s [iter 3/3] write: 44.350s === Results === avg time: 45.032s throughput: 22206 rows/s (total across 1 tasks) Shuffle metrics (last iteration): input_rows: 1000000 write_time: 44.268s (98.3%) repart_time: 0.006s (0.0%) output_rows: 1000000 spill_time: 0.000s (0.0%) ``` # Are there any user-facing changes? The CLI surface of `shuffle_bench` changes (different flags, switched from `structopt` to `clap`). This binary is a developer profiling tool, not a public API; anyone scripting against the previous synthetic flags (`--rows`, `--input-partitions`, `--hash-only`, `--sort-only`, etc.) will need to migrate to the new `--input` / `--writer` form. No library or wire-format changes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
