https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99412
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|ASSIGNED |RESOLVED Resolution|--- |FIXED --- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> --- GCC now does .L2: vmovaps b(%rax), %xmm6 vmulps a(%rax), %xmm6, %xmm0 addq $80, %rax vmovaps b-64(%rax), %xmm7 vmovaps b-48(%rax), %xmm6 vaddps %xmm0, %xmm5, %xmm5 vmulps a-64(%rax), %xmm7, %xmm0 vmovaps b-32(%rax), %xmm7 vaddps %xmm0, %xmm1, %xmm1 vmulps a-48(%rax), %xmm6, %xmm0 vmovaps b-16(%rax), %xmm6 vaddps %xmm0, %xmm4, %xmm4 vmulps a-32(%rax), %xmm7, %xmm0 vaddps %xmm0, %xmm2, %xmm2 vmulps a-16(%rax), %xmm6, %xmm0 vaddps %xmm0, %xmm3, %xmm3 cmpq $128000, %rax jne .L2 thus uses a VF of 4 with -Ofast. Your LLVM snippet uses 5 lanes instead of our 20 with four lanes in a V4SF and one lane in a scalar. That's interesting but not something we support. Re-rolling would mean using a single v4sf 4 lane vector here. For a pure SLP loop something like this should be possible without too much hassle I think. We'd just need to try ... (and think of if it's worth in real life) For the purpose of the Summary this is fixed now.