mbutrovich commented on issue #21543: URL: https://github.com/apache/datafusion/issues/21543#issuecomment-4226750227
Leaving some notes to myself for next week, and anyone else to chime in. One idea I came up with: ### Chunked sort with incremental accumulation 1. Buffer `RecordBatch`es as they arrive (same as today). 2. When ~32K rows accumulate, concat the chunk, sort it, materialize via `take()`, stash as a sorted run. 3. On memory pressure, spill sorted runs directly — they're already sorted, no re-sorting needed. 4. On input exhaustion, k-way merge all runs using the existing loser tree / multi-level merge. For radix-eligible types: encode the chunk to `Rows`, radix sort, drop the `Rows`. Encoding is transient. For dictionaries/nested types: `lexsort_to_indices` on the concatenated chunk, same as the current sub-1MB path. This design is similar to the new immediate-mode shuffle writer in DataFusion Comet: https://github.com/apache/datafusion-comet/pull/3845 ### Why this helps - **Radix sort**: 2–3x faster than `lexsort_to_indices` at 32K rows (apache/arrow-rs#9683), but needs large batches to amortize `RowConverter` encoding. This gives it that. - **Any sort kernel**: Fewer sorted runs means lower merge fan-in. 100K rows goes from ~100 runs of 1K to 3 runs of 32K. - **Memory**: Peak during a chunk sort is `raw_chunk + Rows + sorted_output` for one 32K-row chunk. Bounded and predictable. The 1MB threshold question mostly goes away — sort always operates on fixed-size chunks, memory pressure is handled by spilling sorted runs. For context, [DuckDB's sort redesign](https://duckdb.org/2025/09/24/sorting-again) takes this further — encoding into normalized keys as data arrives, accumulating large thread-local runs. This proposal is a step in that direction while staying within DataFusion's existing architecture. ### Degenerate cases I don't think there are new ones beyond what already exists. Fan-in is strictly better (fewer, larger runs). Skew doesn't affect radix sort (O(n × key_width)). Small datasets (< 32K rows) reduce to a single chunk — same as the current sub-1MB path. The existing multi-level merge handles large fan-in when many chunks are needed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
