https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59811
--- Comment #16 from Richard Biener <rguenth at gcc dot gnu.org> --- Call-grinding a release-checking stage3 shows get_ref_base_and_extend and find_hard_regno_for_1 at the top. And it shows wi::lshift_large called from get_ref_base_and_extent - exactly what I feared... we do hit both wi::lshift_large and wi::mul_internal. perf confirmes the hot spots get_ref_base_and_extent (9%) and find_hard_regno_for_1 (19%) but wi::lshift_large is somewhat down (1.8%), wi::mul_internal is at 1%. Note the shifts are all by 3 (BITS_PER_UNIT multiplication). The /* The first unfilled output block is a left shift of the first block in XVAL. The other output blocks contain bits from two consecutive input blocks. */ unsigned HOST_WIDE_INT carry = 0; for (unsigned int i = skip; i < len; ++i) { unsigned HOST_WIDE_INT x = safe_uhwi (xval, xlen, i - skip); val[i] = (x << small_shift) | carry; carry = x >> (-small_shift % HOST_BITS_PER_WIDE_INT); } loop in lshift_large doesn't seem to be very latency friendly: 4.02 │ e0: mov (%r11,%r9,8),%rax 5.54 │ e4: mov %rax,%rdi ▒ │ mov %r8d,%ecx ▒ │ shl %cl,%rdi ▒ 4.23 │ mov %rdi,%rcx ▒ 3.91 │ or %r15,%rcx ▒ 2.06 │ mov %rcx,(%r14,%r9,8) 7.38 │ mov %r13d,%ecx ▒ 1.41 │ add $0x1,%r9 ▒ │ shr %cl,%rax 3.04 │ cmp %r12,%r9 3.37 │ mov %rax,%r15 1.95 │ ↑ je 9e I wonder if GCC can be more efficient here by special-casing skip == 0, len == 2 and using a __int128 on hosts where that is available. In this case we're shifting xlen == 1 values but the precision might need 2 (byte to bit precision). Special casing that case might also make sense. It helps a bit but of course all the testing has an overhead as well. Maybe a wi::bytes_to_bits helper is a better solution here. Anyway, somehow caching the get_ref_base_and_extent result (which we re-compute only for the stores btw, for stmt_may_clobber_ref_p) might help more. Note that with release-checking the testcase compiles quite fast for me. alias stmt walking : 4.53 (37%) usr 0.04 (18%) sys 4.44 (36%) wall 2 kB ( 0%) ggc dead store elim2 : 0.67 ( 5%) usr 0.04 (18%) sys 0.71 ( 6%) wall 87250 kB (57%) ggc combiner : 0.20 ( 2%) usr 0.00 ( 0%) sys 0.20 ( 2%) wall 2709 kB ( 2%) ggc integrated RA : 1.18 (10%) usr 0.01 ( 5%) sys 1.19 (10%) wall 6629 kB ( 4%) ggc LRA hard reg assignment : 2.69 (22%) usr 0.01 ( 5%) sys 2.69 (22%) wall 0 kB ( 0%) ggc reload CSE regs : 0.56 ( 5%) usr 0.00 ( 0%) sys 0.56 ( 5%) wall 1064 kB ( 1%) ggc TOTAL : 12.21 0.22 12.42 152002 kB (that's w/o mucking with timevars).