Dandandan opened a new pull request, #21336: URL: https://github.com/apache/datafusion/pull/21336
## Which issue does this PR close? N/A — performance optimization. ## Rationale for this change For `ORDER BY <single_primitive_column> LIMIT K` queries, the current TopK operator converts every incoming row through Arrow's `RowConverter` into heap-allocated `Vec<u8>` byte slices, then compares those slices. This has three sources of overhead: 1. **RowConverter batch encoding** — converts arrays to row format even though only a scalar comparison is needed 2. **Heap allocation per row** — each `TopKRow` stores a `Vec<u8>` for the sort key 3. **Byte-slice comparison** — variable-length memcmp instead of a single native comparison ## What changes are included in this PR? Adds a specialized native TopK path (`TopKInner::Native`) for single-column sorts on primitive types (integers, floats, dates, timestamps, durations): - **`NativeTopKRow`**: stores the sort key as an inline `u128` (no heap allocation) with order-preserving encoding that handles ASC/DESC and NULLS FIRST/LAST - **`NativeTopKHeap`**: `BinaryHeap<NativeTopKRow>` with the same compaction/emit logic as the existing `TopKHeap` - **`find_new_native_topk_items`**: type-dispatched inner loop that downcasts to `PrimitiveArray<T>` once per batch, then encodes and compares native values directly - **`TopK`** now holds a `TopKInner` enum (`Row` | `Native`), automatically selecting the native path in `try_new` when applicable The existing row-based path is unchanged for multi-column sorts and non-primitive types. ## Are these changes tested? - All existing TopK unit tests pass (including `test_topk_marks_filter_complete` which exercises the native path via single Int32 column) - All TopK SQL logic tests pass - New unit tests for encoding correctness: signed/unsigned integer ordering, float ordering (including NaN/infinity/negative zero), and all four combinations of ASC/DESC × NULLS FIRST/LAST ## Are there any user-facing changes? No — this is a transparent performance optimization. Query results are unchanged. 🤖 Generated with [Claude Code](https://claude.com/claude-code) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
