Dandandan opened a new pull request, #21830:
URL: https://github.com/apache/datafusion/pull/21830

   ## Which issue does this PR close?
   
   - Closes #.
   
   ## Rationale for this change
   
   Hash-based repartitioning does `hash % num_partitions` for every row in 
every batch. On modern x86 a 64-bit hardware `div` is ~20–80 cycles and — 
crucially — not pipelined, so the div unit becomes a bottleneck on the per-row 
loop in `BatchPartitioner`. Because `num_partitions` is a runtime value, the 
compiler can't strength-reduce it to a multiply like it does for compile-time 
constants.
   
   Lemire's fastrange `((hash as u128) * n) >> 64` produces a uniform mapping 
from a 64-bit hash into `0..n` using one 64×64→128 multiply plus a shift (~4–6 
cycles, fully pipelined). It's what hashbrown and similar use to avoid `%` in 
bucket selection.
   
   The output is not the same partition number as `hash % n` for a given row, 
but the uniformity is equivalent for well-distributed hashes, which is all the 
partitioner cares about.
   
   ## What changes are included in this PR?
   
   - `datafusion/physical-plan/src/repartition/mod.rs`: replace `hash % 
partitions` with fastrange in the hash-partitioning inner loop of 
`BatchPartitioner`.
   
   The round-robin path at the same site still uses `%` on a counter and is out 
of scope here.
   
   ## Are these changes tested?
   
   Covered by the existing repartition tests (`cargo test -p 
datafusion-physical-plan repartition` — 41 tests pass locally). No test pins 
the specific hash→partition mapping; they assert on counts/ordering invariants 
that fastrange preserves.
   
   ## Are there any user-facing changes?
   
   No public API changes. The one observable difference is that a given row may 
land on a different output partition than before for `Hash` partitioning — the 
distribution is still uniform, so downstream operators behave the same, but 
anything externally capturing exact per-partition row identity will shift.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to