On Tue, Apr 28, 2026 at 6:28 PM Charles Schlosser via Gcc <[email protected]> wrote: > > Greetings! I am requesting feedback on whether the gcc community is > interested in this contribution and any technical input on the proposed > approach. Thank you!
First I want to say thank you for sending this off. I noticed you sent this to gcc@ rather than gcc-patches@. Patches normally goto gcc-patches@ for review. Also please read https://gcc.gnu.org/contribute.html; especially the legal part. That being said, the code formatting is totally different from GCC's normal formatting; it is hard to review. Also it would be useful if this patch was split into two. The first one extracts the original code into a new function and then adds the new code on top of it. You also removed a bunch of comments that were there when you extracted into a new function. Also with: if (op0_c) { + t1 = expand_binop(compute_mode, smul_optab, op0_c, + immed_wide_int_const(m_wi, compute_mode), NULL_RTX, 1, + OPTAB_DIRECT); + } + if (t1) { + result = expand_shift(RSHIFT_EXPR, compute_mode, t1, size + post_shift, + NULL_RTX, 1); + } + if (result) { + convert_move(target, result, 1); + } Maybe it would be better if written as: if (!op0_c) continue; ... if (!t1) continue; As it is written, it is hard to follow. As far as the costing goes, I had a generic infrastructure posted but it was thought I was over-thinking it as it was supporting more than 2 costs. Also It would be useful to dump out the costing of each case too. Because most folks want quick debugging via the dumps and a lot of the time it is easier to see the behavior there rather than stepping using a debugger. I have been adding these prints as I see them too. > > *Background:* > > The current unsigned integer division algorithm in GCC (see expmed.cc) > replaces division with a multiplication and shift sequence. > > 1. QUOTIENT = (OP0 * M) >> (SIZE + POST_SHIFT) > > where the multiplication constant (M) can be up to SIZE + 1 bits, where > SIZE is the width of the numerator operand (OP0). The multiplier is > expressed as the sum of the low (ML) and high (MH) parts, where MH is 1 or > 0. > > M = (MH << SIZE) + ML > > If MH == 1, the current implementation assumes that the machine mode is not > sufficiently wide to accommodate the multiplication and distributes the > multiplication. > > 2. QUOTIENT = (OP0 + T1) >> POST_SHIFT > > where T1 = (ML * OP0) >> SIZE. The current implementation further assumes > that the intermediate sum in 2 will overflow, and uses the midpoint trick: > a + b >> 1 == a + (b - a) >> 1 where b >= a . > > 3. QUOTIENT = (T1 + ((OP0 - T1) >> 1)) >> (POST_SHIFT - 1) > > Finally, if the divisor is even, the numerator can be “pre-shifted” to > reduce the size of the magic number, which results in a pre-shifted version > of path 1. The current implementation implicitly assumes (and I confirmed > with benchmarking) that the pre-shift is not worthwhile unless MH == 1. > > 4. QUOTIENT = ((OP0 >> PRE_SHIFT) * ML) >> (SIZE + POST_SHIFT) > > *Proposal:* > > This proposal addresses some missed opportunities to use more efficient > paths for the MH == 1 case. If the machine mode is sufficiently wide to > compute the quotient using paths 1 or 2, the resulting code is faster and > more compact. The pre-shift logic is left as-is, as benchmarks demonstrate > that the pre-shift is counter productive and results in slower code if MH > == 0, even if the RTL cost model predicts otherwise. Please see the > attached patch and github link for implementation details. > > https://github.com/gcc-mirror/gcc/compare/master...chuckyschluz:gcc:wide_udiv > > *Benchmarks:* > > Tested using D = 7 (MH = 1, no pre-shift), and D = 14 (MH = 1, pre-shift) > with google benchmark on x86_64 (i7-13700KF) at -O3. > > > int16_t > > 7: 1.97x speedup (avoid path 3 and use path 1) > 14: 1.47x speedup (avoid path 4 and use path 1) > > int32_t > > 7: 1.33x speedup (avoid path 3 and use path 2) > 14: no change (path 4 is optimal) Thanks, Andrea > > > Charlie Schlosser
