Re: RFC: Optimize unsigned division by constant when wider machine modes are available

Andrew Pinski via Gcc Tue, 28 Apr 2026 22:22:39 -0700

On Tue, Apr 28, 2026 at 6:28 PM Charles Schlosser via Gcc
<[email protected]> wrote:
>
> Greetings!  I am requesting feedback on whether the gcc community is
> interested in this contribution and any technical input on the proposed
> approach. Thank you!


First I want to say thank you for sending this off.
I noticed you sent this to gcc@ rather than gcc-patches@. Patches
normally goto gcc-patches@ for review.
Also please read https://gcc.gnu.org/contribute.html; especially the
legal part.

That being said, the code formatting is totally different from GCC's
normal formatting; it is hard to review.
Also it would be useful if this patch was split into two. The first
one extracts the original code into a new function
and then adds the new code on top of it.
You also removed a bunch of comments that were there when you
extracted into a new function.


Also with:
if (op0_c) {
+        t1 = expand_binop(compute_mode, smul_optab, op0_c,
+                          immed_wide_int_const(m_wi, compute_mode),
NULL_RTX, 1,
+                          OPTAB_DIRECT);
+      }
+      if (t1) {
+        result = expand_shift(RSHIFT_EXPR, compute_mode, t1, size + post_shift,
+                              NULL_RTX, 1);
+      }
+      if (result) {
+        convert_move(target, result, 1);
+      }

Maybe it would be better if written as:
if (!op0_c)
  continue;
...
if (!t1)
  continue;
As it is written, it is hard to follow.

As far as the costing goes, I had a generic infrastructure posted but
it was thought I was over-thinking it as it was supporting more than 2
costs.
Also It would be useful to dump out the costing of each case too.
Because most folks want quick debugging via the dumps and a lot of the
time it is easier to see the behavior there rather than stepping using
a debugger.
I have been adding these prints as I see them too.

>
> *Background:*
>
> The current unsigned integer division algorithm in GCC (see expmed.cc)
> replaces division with a multiplication and shift sequence.
>
> 1. QUOTIENT = (OP0 * M) >> (SIZE + POST_SHIFT)
>
> where the multiplication constant (M) can be up to SIZE + 1 bits, where
> SIZE is the width of the numerator operand (OP0). The multiplier is
> expressed as the sum of the low (ML) and high (MH) parts, where MH is 1 or
> 0.
>
> M = (MH << SIZE) + ML
>
> If MH == 1, the current implementation assumes that the machine mode is not
> sufficiently wide to accommodate the multiplication and distributes the
> multiplication.
>
> 2. QUOTIENT = (OP0 + T1) >> POST_SHIFT
>
> where T1 = (ML * OP0) >> SIZE. The current implementation further assumes
> that the intermediate sum in 2 will overflow, and uses the midpoint trick:
> a + b >> 1 == a + (b - a) >> 1 where b >= a .
>
> 3. QUOTIENT = (T1 + ((OP0 - T1) >> 1)) >> (POST_SHIFT - 1)
>
> Finally, if the divisor is even, the numerator can be “pre-shifted” to
> reduce the size of the magic number, which results in a pre-shifted version
> of path 1. The current implementation implicitly assumes (and I confirmed
> with benchmarking) that the pre-shift is not worthwhile unless MH == 1.
>
> 4. QUOTIENT = ((OP0 >> PRE_SHIFT) * ML) >> (SIZE + POST_SHIFT)
>
> *Proposal:*
>
> This proposal addresses some missed opportunities to use more efficient
> paths for the MH == 1 case. If the machine mode is sufficiently wide to
> compute the quotient using paths 1 or 2, the resulting code is faster and
> more compact. The pre-shift logic is left as-is, as benchmarks demonstrate
> that the pre-shift is counter productive and results in slower code if MH
> == 0, even if the RTL cost model predicts otherwise. Please see the
> attached patch and github link for implementation details.
>
> https://github.com/gcc-mirror/gcc/compare/master...chuckyschluz:gcc:wide_udiv
>
> *Benchmarks:*
>
> Tested using D = 7 (MH = 1, no pre-shift), and D = 14 (MH = 1, pre-shift)
> with google benchmark on x86_64 (i7-13700KF) at -O3.
>
>
> int16_t
>
> 7: 1.97x speedup (avoid path 3 and use path 1)
> 14: 1.47x speedup (avoid path 4 and use path 1)
>
> int32_t
>
> 7: 1.33x speedup (avoid path 3 and use path 2)
> 14: no change (path 4 is optimal)

Thanks,
Andrea

>
>
> Charlie Schlosser

Re: RFC: Optimize unsigned division by constant when wider machine modes are available

Reply via email to