Dandandan opened a new pull request, #21344:
URL: https://github.com/apache/datafusion/pull/21344

   ## Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and 
enhancements and this helps us generate change logs for our releases. You can 
link an issue to this PR using the GitHub syntax. For example `Closes #123` to 
close issue #123.
   -->
   
   Related to #15961
   
   ## Rationale for this change
   
   Profiling `SELECT COUNT(DISTINCT "UserID") FROM hits` (ClickBench) showed 
`GroupValuesPrimitive::intern` as a hot spot, with 
`hashbrown::raw::RawTable::reserve_rehash` and `GroupValuesPrimitive::intern` 
dominating the flamegraph.
   
   ## What changes are included in this PR?
   
   Two optimizations for the single-column primitive GROUP BY hot path:
   
   1. **Vectorized hashing**: Split `intern` into two phases — batch hash 
computation via `with_hashes` (tight loop, better CPU pipelining) followed by 
hash table probing with pre-computed hashes. The original code interleaved hash 
computation with hash table probing on every row, preventing the CPU from 
pipelining the hash computation.
   
   2. **Inline values in hash table**: Store the actual value in each hash 
table entry `(usize, T::Native)` instead of `(usize, u64)` with an indirect 
lookup into a separate `values` vec. This eliminates one cache miss per probe 
(no pointer chase from hash table entry → values array) and removes the need to 
store the hash — the value can be rehashed from the inline copy when needed 
(rare, only during table growth).
   
   ## Are these changes tested?
   
   Existing tests cover this code path.
   
   ## Are there any user-facing changes?
   
   No, this is a performance optimization only.
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to