Dandandan opened a new pull request, #21830: URL: https://github.com/apache/datafusion/pull/21830
## Which issue does this PR close? - Closes #. ## Rationale for this change Hash-based repartitioning does `hash % num_partitions` for every row in every batch. On modern x86 a 64-bit hardware `div` is ~20–80 cycles and — crucially — not pipelined, so the div unit becomes a bottleneck on the per-row loop in `BatchPartitioner`. Because `num_partitions` is a runtime value, the compiler can't strength-reduce it to a multiply like it does for compile-time constants. Lemire's fastrange `((hash as u128) * n) >> 64` produces a uniform mapping from a 64-bit hash into `0..n` using one 64×64→128 multiply plus a shift (~4–6 cycles, fully pipelined). It's what hashbrown and similar use to avoid `%` in bucket selection. The output is not the same partition number as `hash % n` for a given row, but the uniformity is equivalent for well-distributed hashes, which is all the partitioner cares about. ## What changes are included in this PR? - `datafusion/physical-plan/src/repartition/mod.rs`: replace `hash % partitions` with fastrange in the hash-partitioning inner loop of `BatchPartitioner`. The round-robin path at the same site still uses `%` on a counter and is out of scope here. ## Are these changes tested? Covered by the existing repartition tests (`cargo test -p datafusion-physical-plan repartition` — 41 tests pass locally). No test pins the specific hash→partition mapping; they assert on counts/ordering invariants that fastrange preserves. ## Are there any user-facing changes? No public API changes. The one observable difference is that a given row may land on a different output partition than before for `Hash` partitioning — the distribution is still uniform, so downstream operators behave the same, but anything externally capturing exact per-partition row identity will shift. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
