https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89049
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Assignee|rguenth at gcc dot gnu.org |unassigned at gcc dot gnu.org Known to fail| |11.0 Last reconfirmed|2019-01-25 00:00:00 |2021-3-11 Status|ASSIGNED |NEW --- Comment #13 from Richard Biener <rguenth at gcc dot gnu.org> --- Re-confirmed. Note the same in-order reduction is profitable with SSE: 0x3704d70 *_3 1 times scalar_load costs 12 in body 0x3704d70 _4 + r_16 1 times scalar_stmt costs 12 in body 0x3825d80 _4 + r_16 4 times vec_to_scalar costs 16 in body 0x3825d80 _4 + r_16 4 times scalar_stmt costs 48 in body 0x3825d80 *_3 1 times unaligned_load (misalign -1) costs 12 in body t4.c:1:53: note: Cost model analysis: Vector inside of loop cost: 76 Vector prologue cost: 0 Vector epilogue cost: 0 Scalar iteration cost: 24 Scalar outside cost: 0 Vector outside cost: 0 prologue iterations: 0 epilogue iterations: 0 Calculated minimum iters for profitability: 0 bar: .LFB0: .cfi_startproc leaq 4096(%rdi), %rax pxor %xmm1, %xmm1 .p2align 4,,10 .p2align 3 .L2: movups (%rdi), %xmm0 addq $16, %rdi addss %xmm0, %xmm1 movaps %xmm0, %xmm2 shufps $85, %xmm0, %xmm2 addss %xmm2, %xmm1 movaps %xmm0, %xmm2 unpckhps %xmm0, %xmm2 shufps $255, %xmm0, %xmm0 addss %xmm2, %xmm1 addss %xmm0, %xmm1 cmpq %rdi, %rax jne .L2 cvttss2sil %xmm1, %eax ret iff the code in forwprop that decomposes loads of BLKmode vectors used only by BIT_FIELD_REFs would be relaxed to cover all modes and TARGET_MEM_REFs that are simple we'd get .L2: addl $1, %eax addss (%rdi), %xmm0 addss 4(%rdi), %xmm0 addq $16, %rdi addss -8(%rdi), %xmm0 addss -4(%rdi), %xmm0 cmpl $256, %eax jne .L2