https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99412
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|ASSIGNED |RESOLVED
Resolution|--- |FIXED
--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> ---
GCC now does
.L2:
vmovaps b(%rax), %xmm6
vmulps a(%rax), %xmm6, %xmm0
addq $80, %rax
vmovaps b-64(%rax), %xmm7
vmovaps b-48(%rax), %xmm6
vaddps %xmm0, %xmm5, %xmm5
vmulps a-64(%rax), %xmm7, %xmm0
vmovaps b-32(%rax), %xmm7
vaddps %xmm0, %xmm1, %xmm1
vmulps a-48(%rax), %xmm6, %xmm0
vmovaps b-16(%rax), %xmm6
vaddps %xmm0, %xmm4, %xmm4
vmulps a-32(%rax), %xmm7, %xmm0
vaddps %xmm0, %xmm2, %xmm2
vmulps a-16(%rax), %xmm6, %xmm0
vaddps %xmm0, %xmm3, %xmm3
cmpq $128000, %rax
jne .L2
thus uses a VF of 4 with -Ofast. Your LLVM snippet uses 5 lanes
instead of our 20 with four lanes in a V4SF and one lane in a scalar.
That's interesting but not something we support.
Re-rolling would mean using a single v4sf 4 lane vector here. For
a pure SLP loop something like this should be possible without too
much hassle I think. We'd just need to try ... (and think of if it's
worth in real life)
For the purpose of the Summary this is fixed now.