andygrove opened a new pull request, #1603:
URL: https://github.com/apache/datafusion-ballista/pull/1603

   # Which issue does this PR close?
   
   N/A — small diagnostic / observability change.
   
   # Rationale for this change
   
   The standalone shuffle benchmark in #1600 reports ~98% of sort-shuffle wall 
time inside the existing `write_time` metric (e.g. `write_time: 44.268s` of a 
45s run, with `repart_time: 0.006s` and `spill_time: 0.000s`). That timer wraps 
the entire `sort_shuffle::finalize_output` block — Arrow IPC encoding, 
dictionary tracking, LZ4/Zstd compression, and `BufWriter::write_all` — fused 
into a single number. There is no seam where a timer can sit between "encode" 
and "disk", so it is impossible to tell whether `finalize_output` is CPU-bound 
or I/O-bound from the metrics alone.
   
   This becomes a blocker for evaluating optimizations like Comet's 
`BufBatchWriter` IPC-block coalescing. Without splitting encode from disk, you 
can't measure whether such a port helps or just shifts cost between phases.
   
   # What changes are included in this PR?
   
   - **New module** `ballista/core/src/execution_plans/timed_write.rs`: a 
`TimedWrite<W: Write>` adapter that records the elapsed time of each `write` / 
`flush` call into a `metrics::Time`. Two unit tests cover byte-through-pass 
behaviour and non-zero accumulation.
   - **`sort_shuffle/writer.rs`**: adds a `disk_write_time` metric to 
`SortShuffleWriteMetrics` and threads it into `finalize_output`. The 
consolidated output `BufWriter<File>` is wrapped in `TimedWrite` so the 
inner-byte-pump cost is recorded separately. The outer `write_time` timer is 
unchanged, so `write_time - disk_write_time ≈ encode + compress` cost.
   - **New test** `test_sort_shuffle_writer_records_disk_write_time` runs the 
writer end-to-end and asserts the new metric is reported with a non-zero value.
   
   The hash-path `ShuffleWriterExec` and the sort-shuffle `SpillManager` are 
intentionally untouched — neither is on the hot path the bench identified, and 
keeping the change scoped makes it easier to review and revert if the 
diagnostic reveals the gap is elsewhere.
   
   # Are there any user-facing changes?
   
   A new metric `disk_write_time` is reported by `SortShuffleWriterExec`. 
Existing metrics (`write_time`, `repart_time`, `spill_time`, etc.) are 
unchanged in name and semantics. No public API or wire-format changes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to