(arrow-rs) branch main updated: Optimize partition_validity function used in sort kernels (#7937)

alamb Fri, 18 Jul 2025 04:57:43 -0700

This is an automated email from the ASF dual-hosted git repository.

alamb pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow-rs.git



The following commit(s) were added to refs/heads/main by this push:
     new 233dad39b6 Optimize partition_validity function used in sort kernels 
(#7937)
233dad39b6 is described below

commit 233dad39b65b9eba9203450fca150094db9c7fcc
Author: Jörn Horstmann <[email protected]>
AuthorDate: Fri Jul 18 13:55:56 2025 +0200

    Optimize partition_validity function used in sort kernels (#7937)
    
    # Which issue does this PR close?
    
    Optimize `partition_validity` function used in sort kernels
    
    - Preallocate vectors based on known null counts
    - Avoid dynamic dispatch by calling `NullBuffer::is_valid` instead of
    `Array::is_valid`
    - Avoid capacity checks inside loop by writing to `spare_capacity_mut`
    instead of using `push`
    - Closes #7936.
    
    # Rationale for this change
    
    Microbenchmark results for `sort_kernels` compared to `main`, only
    looking at benchmarks matching "nulls to indices":
    
    ```
    sort i32 nulls to indices 2^10
                            time:   [4.9325 µs 4.9370 µs 4.9422 µs]
                            change: [−20.303% −20.133% −19.974%] (p = 0.00 < 
0.05)
                            Performance has improved.
    
    sort i32 nulls to indices 2^12
                            time:   [20.096 µs 20.209 µs 20.327 µs]
                            change: [−26.819% −26.275% −25.697%] (p = 0.00 < 
0.05)
                            Performance has improved.
    
    sort f32 nulls to indices 2^12
                            time:   [26.329 µs 26.366 µs 26.406 µs]
                            change: [−29.487% −29.331% −29.146%] (p = 0.00 < 
0.05)
                            Performance has improved.
    
    sort string[0-10] nulls to indices 2^12
                            time:   [70.667 µs 70.762 µs 70.886 µs]
                            change: [−20.057% −19.935% −19.819%] (p = 0.00 < 
0.05)
                            Performance has improved.
    
    sort string[0-100] nulls to indices 2^12
                            time:   [101.98 µs 102.44 µs 102.99 µs]
                            change: [−0.3501% +0.0835% +0.4918%] (p = 0.71 > 
0.05)
                            No change in performance detected.
    
    sort string[0-400] nulls to indices 2^12
                            time:   [84.952 µs 85.024 µs 85.102 µs]
                            change: [−5.3969% −4.9827% −4.6421%] (p = 0.00 < 
0.05)
                            Performance has improved.
    
    sort string[10] nulls to indices 2^12
                            time:   [72.486 µs 72.664 µs 72.893 µs]
                            change: [−14.937% −14.781% −14.599%] (p = 0.00 < 
0.05)
                            Performance has improved.
    
    sort string[100] nulls to indices 2^12
                            time:   [71.354 µs 71.606 µs 71.902 µs]
                            change: [−17.207% −16.795% −16.373%] (p = 0.00 < 
0.05)
                            Performance has improved.
    
    sort string[1000] nulls to indices 2^12
                            time:   [73.088 µs 73.195 µs 73.311 µs]
                            change: [−16.705% −16.599% −16.483%] (p = 0.00 < 
0.05)
                            Performance has improved.
    
    sort string_view[10] nulls to indices 2^12
                            time:   [32.592 µs 32.654 µs 32.731 µs]
                            change: [−15.722% −15.512% −15.310%] (p = 0.00 < 
0.05)
                            Performance has improved.
    
    sort string_view[0-400] nulls to indices 2^12
                            time:   [32.981 µs 33.074 µs 33.189 µs]
                            change: [−25.570% −25.132% −24.700%] (p = 0.00 < 
0.05)
                            Performance has improved.
    
    sort string_view_inlined[0-12] nulls to indices 2^12
                            time:   [28.467 µs 28.496 µs 28.529 µs]
                            change: [−22.978% −22.786% −22.574%] (p = 0.00 < 
0.05)
                            Performance has improved.
    
    sort string[10] dict nulls to indices 2^12
                            time:   [94.463 µs 94.503 µs 94.542 µs]
                            change: [−11.386% −11.165% −10.961%] (p = 0.00 < 
0.05)
                            Performance has improved.
    ```
    
    # Are these changes tested?
    
    Covered by existing tests
    
    # Are there any user-facing changes?
    
    No, the method is internal to the sort kernels.
---
 arrow-ord/src/sort.rs | 40 ++++++++++++++++++++++++++++++++++------
 1 file changed, 34 insertions(+), 6 deletions(-)

diff --git a/arrow-ord/src/sort.rs b/arrow-ord/src/sort.rs
index 3a2d372e04..be515c3f10 100644
--- a/arrow-ord/src/sort.rs
+++ b/arrow-ord/src/sort.rs
@@ -180,13 +180,41 @@ where
 
 // partition indices into valid and null indices
 fn partition_validity(array: &dyn Array) -> (Vec<u32>, Vec<u32>) {
-    match array.null_count() {
-        // faster path
-        0 => ((0..(array.len() as u32)).collect(), vec![]),
-        _ => {
-            let indices = 0..(array.len() as u32);
-            indices.partition(|index| array.is_valid(*index as usize))
+    let len = array.len();
+    let null_count = array.null_count();
+    match array.nulls() {
+        Some(nulls) if null_count > 0 => {
+            let mut valid_indices = Vec::with_capacity(len - null_count);
+            let mut null_indices = Vec::with_capacity(null_count);
+
+            let valid_slice = valid_indices.spare_capacity_mut();
+            let null_slice = null_indices.spare_capacity_mut();
+            let mut valid_idx = 0;
+            let mut null_idx = 0;
+
+            nulls.into_iter().enumerate().for_each(|(i, v)| {
+                if v {
+                    valid_slice[valid_idx].write(i as u32);
+                    valid_idx += 1;
+                } else {
+                    null_slice[null_idx].write(i as u32);
+                    null_idx += 1;
+                }
+            });
+
+            assert_eq!(null_idx, null_count);
+            assert_eq!(valid_idx, len - null_count);
+            // Safety: The new lengths match the initial capacity as asserted 
above,
+            // the bounds checks while writing also ensure they less than or 
equal to the capacity.
+            unsafe {
+                valid_indices.set_len(valid_idx);
+                null_indices.set_len(null_idx);
+            }
+
+            (valid_indices, null_indices)
         }
+        // faster path
+        _ => ((0..(len as u32)).collect(), vec![]),
     }
 }

(arrow-rs) branch main updated: Optimize partition_validity function used in sort kernels (#7937)

Reply via email to