https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89049
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Assignee|rguenth at gcc dot gnu.org |unassigned at gcc dot
gnu.org
Known to fail| |11.0
Last reconfirmed|2019-01-25 00:00:00 |2021-3-11
Status|ASSIGNED |NEW
--- Comment #13 from Richard Biener <rguenth at gcc dot gnu.org> ---
Re-confirmed. Note the same in-order reduction is profitable with SSE:
0x3704d70 *_3 1 times scalar_load costs 12 in body
0x3704d70 _4 + r_16 1 times scalar_stmt costs 12 in body
0x3825d80 _4 + r_16 4 times vec_to_scalar costs 16 in body
0x3825d80 _4 + r_16 4 times scalar_stmt costs 48 in body
0x3825d80 *_3 1 times unaligned_load (misalign -1) costs 12 in body
t4.c:1:53: note: Cost model analysis:
Vector inside of loop cost: 76
Vector prologue cost: 0
Vector epilogue cost: 0
Scalar iteration cost: 24
Scalar outside cost: 0
Vector outside cost: 0
prologue iterations: 0
epilogue iterations: 0
Calculated minimum iters for profitability: 0
bar:
.LFB0:
.cfi_startproc
leaq 4096(%rdi), %rax
pxor %xmm1, %xmm1
.p2align 4,,10
.p2align 3
.L2:
movups (%rdi), %xmm0
addq $16, %rdi
addss %xmm0, %xmm1
movaps %xmm0, %xmm2
shufps $85, %xmm0, %xmm2
addss %xmm2, %xmm1
movaps %xmm0, %xmm2
unpckhps %xmm0, %xmm2
shufps $255, %xmm0, %xmm0
addss %xmm2, %xmm1
addss %xmm0, %xmm1
cmpq %rdi, %rax
jne .L2
cvttss2sil %xmm1, %eax
ret
iff the code in forwprop that decomposes loads of BLKmode vectors used only
by BIT_FIELD_REFs would be relaxed to cover all modes and TARGET_MEM_REFs
that are simple we'd get
.L2:
addl $1, %eax
addss (%rdi), %xmm0
addss 4(%rdi), %xmm0
addq $16, %rdi
addss -8(%rdi), %xmm0
addss -4(%rdi), %xmm0
cmpl $256, %eax
jne .L2