damahua opened a new pull request, #21325:
URL: https://github.com/apache/datafusion/pull/21325
## Which issue does this PR close?
Related to #19444 (sort spill StringView gc) and #20500 (repartition
StringView gc). This PR extends the same fix to hash aggregation and sort-merge
join spill paths, and adds BinaryViewArray support to the sort operator.
## Rationale
After operations like `take` or `slice`, `StringViewArray` and
`BinaryViewArray` retain shared references to **all** original data buffers.
When these batches are written to spill files individually, the IPC writer must
include every referenced buffer for every batch, causing massive write
amplification.
The sort operator already had a fix for this (`organize_stringview_arrays`
in `sort.rs`), but the **hash aggregation** and **sort-merge join** spill paths
were missing it. Additionally, the sort operator's fix only handled
`StringViewArray`, not `BinaryViewArray`.
### Hash aggregation spill path
In `row_hash.rs`, `IncrementalSortIterator` produces output chunks via
`take_record_batch`. Each chunk shares the same StringView data buffers as the
parent emitted batch. Without `gc()`, spilling N chunks writes N copies of all
shared buffers.
### Sort-merge join spill path
In `bitwise_stream.rs`, `inner_key_buffer` contains sliced batches that
share StringView data buffers with the original unsliced batches.
## Changes
1. **New `gc_view_arrays()` utility** in `spill/mod.rs` — compacts both
`StringViewArray` and `BinaryViewArray` in a `RecordBatch`, returning the batch
unchanged (no allocation) when no view-type columns exist
2. **Hash aggregation spill** (`row_hash.rs`) — gc each
`IncrementalSortIterator` output batch before writing to the spill file
3. **Sort-merge join spill** (`bitwise_stream.rs`) — gc sliced
`inner_key_buffer` batches before spilling
4. **Sort operator** (`sort.rs`) — extended existing
`organize_stringview_arrays` to also handle `BinaryViewArray`
## A/B Benchmark Results
**Workload:** `SELECT group_key, COUNT(*), SUM(value) FROM t GROUP BY
group_key`
- 100,000 rows, 50,000 unique groups (high cardinality → forces spilling)
- `group_key`: `Utf8View` (StringViewArray) with 50+ byte non-inline strings
- Memory pool: 20 MB (FairSpillPool), single partition, batch_size=8192
- Same source commit, only gc patch differs. N=3 runs, deterministic data (0
variance).
| Metric | Baseline (no gc) | With gc | Change |
|--------|-----------------|---------|--------|
| **Spilled bytes** | **39.50 MB** | **7.90 MB** | **-80.0%** |
| Output bytes | 103.3 MB | 35.5 MB | -65.6% |
| Query time | 321 ± 11 ms | 324 ± 9 ms | ~same |
| Peak memory | 19.82 MB | 19.82 MB | same |
| Spill count | 2 | 2 | same |
With a tighter 8 MB pool: **baseline OOMs** (`ResourcesExhausted: Failed to
allocate additional 10.4 MB for GroupedHashAggregateStream`) because inflated
StringView buffers cause the sort memory estimate to exceed the pool.
**Optimized completes successfully** (5 spills, 20.9 MB spilled).
## Tests
- All 1,299 existing `datafusion-physical-plan` lib tests pass (1
pre-existing zstd feature failure)
- All 29 `memory_limit` integration tests pass
- All 56 `sort_merge_join` tests pass
- Added 4 new tests:
- `test_gc_view_arrays_reduces_spill_size` — verifies gc compacts taken
StringView/BinaryView batches
- `test_gc_view_arrays_write_amplification` — demonstrates 8.5× write
amplification without gc
- `test_gc_view_arrays_noop_for_non_view_types` — verifies no overhead for
non-view types
- `bench_stringview_aggregate_spill` — end-to-end benchmark with EXPLAIN
ANALYZE metrics
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]