SubhamSinghal opened a new pull request, #21775:
URL: https://github.com/apache/datafusion/pull/21775

     ## Which issue does this PR close?                                         
                                                                                
                       
                                                                                
                                                                                
                       
     - Closes #.                                                                
                                                                                
                       
                                                                                
                                                                                
                       
     ## Rationale for this change                                               
                                                                                
                       
   For RightSemi/RightAnti hash joins, the probe phase only needs to know 
whether **at least one** matching build row exists per probe row. However, the 
current code:               
      
     1. Traverses the **entire** collision chain, emitting every `(probe, 
build)` pair                                                                    
                             
     2. Runs `equal_rows_arr` on **all** pairs (`take()` + `eq_dyn_null()`, 
allocating intermediate arrays)
     3. Deduplicates via `get_semi_indices` to keep just the first match per 
probe row                                                                       
                          
    
   For skewed data with N duplicate build rows per key, this produces N pairs 
per probe row only to keep 1 — O(N) wasted work in both chain traversal and 
equality verification.     
                                                                                
                                                                                
                       
     ## What changes are included in this PR?                                   
                                                                                
                       
                                                               
     **New `get_first_match` method on `JoinHashMapType` trait** 
(`join_hash_map.rs`):                                                           
                                      
     - Walks the collision chain for a hash value, stops at the first entry 
where a caller-provided predicate returns true
     - Returns `Option<u64>` — the matching build row index, or `None`          
                                                                                
                       
     - Shared implementation in `get_first_match_impl`, reused by 
`JoinHashMapU32`, `JoinHashMapU64`, and `PruningJoinHashMap`                    
                                     
                                                                                
                                                                                
                       
     **New `process_probe_batch_right_semi_anti` function** (`stream.rs`):      
                                                                                
                       
     - For each probe row, calls `get_first_match` with an equality predicate 
built from `JoinKeyComparator::is_equal` (zero-allocation per-row comparison)   
                         
     - Collects matched probe indices (RightSemi) or unmatched probe indices 
(RightAnti)                                                                     
                          
     - Builds output via existing `build_batch_from_indices`                    
                                                                                
                       
                                                                                
                                                                                
                                                                                
                                                                           
      
     ## Are these changes tested?                                               
                                                                                
                       
                                                               
     Covered by existing tests:                                                 
                                                                                
                       
     - 791 hash join unit tests (all pass)                     
     - 14 SLT join test files (all pass)                                        
                                                                                
                       
     - TPC-H SF1 benchmark: no regressions, Q4/Q21 (semi/anti patterns) show 
minor improvements                                                              
                          
                                                                                
                                                                                
                       
     No new tests added — this is a pure optimization that produces identical 
results. The existing test suite comprehensively covers RightSemi/RightAnti 
join.                                                                           
                                       
                                                                                
                                                                                
                       
     ## Are there any user-facing changes?                                      
                                                                                
                       
      
     No.
   
   ## TPC-H  Benchmark result:
   
   | Query | main (ms) | new (ms) | Change |
     |-------|-----------|----------|--------|
     | Q1 | 2.96 | 2.89 | ~same |                                               
                                                                                
                       
     | Q2 | 3.12 | 2.80 | ~same |                                               
                                                                                
                       
     | Q3 | 1.70 | 1.77 | ~same |                                               
                                                                                
                       
     | Q4 | 1.23 | 1.15 | -7% |                                                 
                                                                                
                       
     | Q5 | 2.11 | 2.04 | ~same |                                               
                                                                                
                       
     | Q6 | 0.69 | 0.75 | ~same |                                               
                                                                                
                       
     | Q7 | 2.58 | 2.51 | ~same |                                               
                                                                                
                       
     | Q8 | 2.94 | 2.91 | ~same |                                               
                                                                                
                       
     | Q9 | 2.19 | 2.03 | ~same |                                               
                                                                                
                       
     | Q10 | 2.20 | 2.15 | ~same |                                              
                                                                                
                       
     | Q11 | 1.95 | 1.91 | ~same |                                              
                                                                                
                       
     | Q12 | 1.38 | 1.36 | ~same |                                              
                                                                                
                       
     | Q13 | 1.23 | 1.10 | -11% |                                               
                                                                                
                       
     | Q14 | 1.18 | 1.12 | ~same |                                              
                                                                                
                       
     | Q15 | 2.28 | 2.22 | ~same |                                              
                                                                                
                       
     | Q16 | 1.76 | 1.70 | ~same |                                              
                                                                                
                       
     | Q17 | 1.37 | 1.26 | ~same |                                              
                                                                                
                       
     | Q18 | 1.61 | 1.45 | -10% |                                               
                                                                                
                       
     | Q19 | 2.28 | 2.05 | -10% |                                               
                                                                                
                       
     | Q20 | 2.01 | 1.81 | -10% |                                               
                                                                                
                       
     | Q21 | 2.72 | 2.23 | **-18%** |                                           
                                                                                
                       
     | Q22 | 1.89 | 1.75 | ~same |                


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to