Re: [I] Revisit ExternalSorter sort strategy and sort_in_place_threshold_bytes [datafusion]

via GitHub Fri, 10 Apr 2026 14:13:06 -0700


mbutrovich commented on issue #21543:
URL: https://github.com/apache/datafusion/issues/21543#issuecomment-4226750227


   Leaving some notes to myself for next week, and anyone else to chime in. One 
idea I came up with:
   
   ### Chunked sort with incremental accumulation
   
   1. Buffer `RecordBatch`es as they arrive (same as today).
   2. When ~32K rows accumulate, concat the chunk, sort it, materialize via 
`take()`, stash as a sorted run.
   3. On memory pressure, spill sorted runs directly — they're already sorted, 
no re-sorting needed.
   4. On input exhaustion, k-way merge all runs using the existing loser tree / 
multi-level merge.
   
   For radix-eligible types: encode the chunk to `Rows`, radix sort, drop the 
`Rows`. Encoding is transient. For dictionaries/nested types: 
`lexsort_to_indices` on the concatenated chunk, same as the current sub-1MB 
path.
   
   This design is similar to the new immediate-mode shuffle writer in 
DataFusion Comet: https://github.com/apache/datafusion-comet/pull/3845
   
   ### Why this helps
   
   - **Radix sort**: 2–3x faster than `lexsort_to_indices` at 32K rows 
(apache/arrow-rs#9683), but needs large batches to amortize `RowConverter` 
encoding. This gives it that.
   - **Any sort kernel**: Fewer sorted runs means lower merge fan-in. 100K rows 
goes from ~100 runs of 1K to 3 runs of 32K.
   - **Memory**: Peak during a chunk sort is `raw_chunk + Rows + sorted_output` 
for one 32K-row chunk. Bounded and predictable. The 1MB threshold question 
mostly goes away — sort always operates on fixed-size chunks, memory pressure 
is handled by spilling sorted runs.
   
   For context, [DuckDB's sort 
redesign](https://duckdb.org/2025/09/24/sorting-again) takes this further — 
encoding into normalized keys as data arrives, accumulating large thread-local 
runs. This proposal is a step in that direction while staying within 
DataFusion's existing architecture.
   
   ### Degenerate cases
   
   I don't think there are new ones beyond what already exists. Fan-in is 
strictly better (fewer, larger runs). Skew doesn't affect radix sort (O(n × 
key_width)). Small datasets (< 32K rows) reduce to a single chunk — same as the 
current sub-1MB path. The existing multi-level merge handles large fan-in when 
many chunks are needed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Revisit ExternalSorter sort strategy and sort_in_place_threshold_bytes [datafusion]

Reply via email to