https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59811
--- Comment #16 from Richard Biener <rguenth at gcc dot gnu.org> ---
Call-grinding a release-checking stage3 shows get_ref_base_and_extend and
find_hard_regno_for_1 at the top.
And it shows wi::lshift_large called from get_ref_base_and_extent - exactly
what I feared... we do hit both wi::lshift_large and wi::mul_internal.
perf confirmes the hot spots get_ref_base_and_extent (9%) and
find_hard_regno_for_1 (19%) but wi::lshift_large is somewhat down (1.8%),
wi::mul_internal is at 1%. Note the shifts are all by 3 (BITS_PER_UNIT
multiplication).
The
/* The first unfilled output block is a left shift of the first
block in XVAL. The other output blocks contain bits from two
consecutive input blocks. */
unsigned HOST_WIDE_INT carry = 0;
for (unsigned int i = skip; i < len; ++i)
{
unsigned HOST_WIDE_INT x = safe_uhwi (xval, xlen, i - skip);
val[i] = (x << small_shift) | carry;
carry = x >> (-small_shift % HOST_BITS_PER_WIDE_INT);
}
loop in lshift_large doesn't seem to be very latency friendly:
4.02 │ e0: mov (%r11,%r9,8),%rax
5.54 │ e4: mov %rax,%rdi
▒
│ mov %r8d,%ecx
▒
│ shl %cl,%rdi
▒
4.23 │ mov %rdi,%rcx
▒
3.91 │ or %r15,%rcx
▒
2.06 │ mov %rcx,(%r14,%r9,8)
7.38 │ mov %r13d,%ecx
▒
1.41 │ add $0x1,%r9
▒
│ shr %cl,%rax
3.04 │ cmp %r12,%r9
3.37 │ mov %rax,%r15
1.95 │ ↑ je 9e
I wonder if GCC can be more efficient here by special-casing skip == 0,
len == 2 and using a __int128 on hosts where that is available.
In this case we're shifting xlen == 1 values but the precision might need 2
(byte to bit precision). Special casing that case might also make sense.
It helps a bit but of course all the testing has an overhead as well.
Maybe a wi::bytes_to_bits helper is a better solution here.
Anyway, somehow caching the get_ref_base_and_extent result (which we re-compute
only for the stores btw, for stmt_may_clobber_ref_p) might help more.
Note that with release-checking the testcase compiles quite fast for me.
alias stmt walking : 4.53 (37%) usr 0.04 (18%) sys 4.44 (36%) wall
2 kB ( 0%) ggc
dead store elim2 : 0.67 ( 5%) usr 0.04 (18%) sys 0.71 ( 6%) wall
87250 kB (57%) ggc
combiner : 0.20 ( 2%) usr 0.00 ( 0%) sys 0.20 ( 2%) wall
2709 kB ( 2%) ggc
integrated RA : 1.18 (10%) usr 0.01 ( 5%) sys 1.19 (10%) wall
6629 kB ( 4%) ggc
LRA hard reg assignment : 2.69 (22%) usr 0.01 ( 5%) sys 2.69 (22%) wall
0 kB ( 0%) ggc
reload CSE regs : 0.56 ( 5%) usr 0.00 ( 0%) sys 0.56 ( 5%) wall
1064 kB ( 1%) ggc
TOTAL : 12.21 0.22 12.42
152002 kB
(that's w/o mucking with timevars).