https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88492
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Keywords| |missed-optimization
CC| |rguenth at gcc dot gnu.org
--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
IIRC we have a duplicate for this. The issue is the SLP vectorizer doesn't
handle reductions (not implemented) and thus the vector results need
to be decomposed for the scalar reduction tail. On x86 we get with -mavx2
vmovdqu (%rdi), %xmm0
vpshufb .LC0(%rip), %xmm0, %xmm0
vpmovzxbw %xmm0, %xmm1
vpsrldq $8, %xmm0, %xmm0
vpmovzxwd %xmm1, %xmm2
vpsrldq $8, %xmm1, %xmm1
vpmovzxbw %xmm0, %xmm0
vpmovzxwd %xmm1, %xmm1
vmovaps %xmm2, -72(%rsp)
movl -68(%rsp), %eax
vmovaps %xmm1, -56(%rsp)
vpmovzxwd %xmm0, %xmm1
vpsrldq $8, %xmm0, %xmm0
addl -52(%rsp), %eax
vpmovzxwd %xmm0, %xmm0
vmovaps %xmm1, -40(%rsp)
movl -56(%rsp), %edx
addl -36(%rsp), %eax
vmovaps %xmm0, -24(%rsp)
addl -72(%rsp), %edx
addl -20(%rsp), %eax
addl -40(%rsp), %edx
addl -24(%rsp), %edx
addl %edx, %eax
movl -48(%rsp), %edx
addl -64(%rsp), %edx
addl -32(%rsp), %edx
addl -16(%rsp), %edx
addl %edx, %eax
movl -44(%rsp), %edx
addl -60(%rsp), %edx
addl -28(%rsp), %edx
addl -12(%rsp), %edx
addl %edx, %eax
ret
the main issue of course that we fail to elide the stack temporary.
Re-running FRE after loop opts might help here but of course
SLP vectorization handling the reduction would be best (though the
tail loop is structured badly, not matching up with the head one).
Whether vectorizing this specific testcases head loop is profitable
or not is questionable on its own of course (but you can easily make
it so and still get similar ugly code in the tail).