https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84037
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |amker at gcc dot gnu.org
--- Comment #13 from Richard Biener <rguenth at gcc dot gnu.org> ---
So performance counters on my Broadwell machine say that there are zero hits
for this vectorized loop with LSD.UOPS while there are very many for the scalar
case. This means the loop body is too large to trigger the LSD (and probably
also to fit the uop cache).
One of the issue is that we require 4 registers for the indexes into the loads
vmovq (%rax,%r11,2), %xmm7
vpinsrq $1, (%rax,%r13), %xmm7, %xmm4
vmovq (%rax), %xmm7
vpinsrq $1, (%rax,%r11), %xmm7, %xmm9
vmovq (%rax,%r15), %xmm7
vpinsrq $1, (%rax,%r12), %xmm7, %xmm3
vmovq (%rax,%r11,4), %xmm7
vpinsrq $1, (%rax,%r14), %xmm7, %xmm1
vinserti128 $0x1, %xmm4, %ymm9, %ymm9
from
_1507 = (void *) ivtmp.760_1462;
_792 = MEM[base: _1507, offset: 0B];
_1508 = (void *) ivtmp.760_1462;
_794 = MEM[base: _1508, index: _331, offset: 0B];
_1509 = (void *) ivtmp.760_1462;
_796 = MEM[base: _1509, index: _331, step: 2, offset: 0B];
_1510 = (void *) ivtmp.760_1462;
_1511 = _331 * 3;
_798 = MEM[base: _1510, index: _1511, offset: 0B];
_1512 = (void *) ivtmp.760_1462;
_800 = MEM[base: _1512, index: _331, step: 4, offset: 0B];
_1513 = (void *) ivtmp.760_1462;
_1514 = _331 * 5;
_802 = MEM[base: _1513, index: _1514, offset: 0B];
_1515 = (void *) ivtmp.760_1462;
_1516 = _331 * 6;
_804 = MEM[base: _1515, index: _1516, offset: 0B];
_1517 = (void *) ivtmp.760_1462;
_1518 = _331 * 7;
_806 = MEM[base: _1517, index: _1518, offset: 0B];
vect_cst__808 = {_792, _794, _796, _798, _800, _802, _804, _806};
where IVOPTs did a reasonable job. Later LIM hoists all the invariant
_311 * N indexes. And IVOPTs failed to realize that _331 * 3 can be used
for _331 * 6 by using step == 2. But in the end the register optimal
decision is probably to strength-reduce this (the vectorizer generates
strength-reduced code).
We do end up spilling most IVs in this loop.