https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57952
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Known to work| |10.1.0 Resolution|--- |FIXED Status|UNCONFIRMED |RESOLVED --- Comment #10 from Richard Biener <rguenth at gcc dot gnu.org> --- I can now see this vectorized with at least GCC 10 and up. Note we're vectorizing the _outer_ loop here but we also manage to vectorize the inner loop only if I comment out the outer one, it just looks less efficient. .L2: vmovdqa %ymm6, %ymm2 movl $10000000, %eax .p2align 4,,10 .p2align 3 .L3: vmovdqa %ymm2, %ymm0 vpaddd %ymm6, %ymm2, %ymm2 vcvtdq2ps %ymm0, %ymm0 vaddps %ymm5, %ymm0, %ymm0 vmulps %ymm11, %ymm0, %ymm0 vmovaps %ymm0, %ymm1 vfmadd132ps %ymm10, %ymm9, %ymm1 vfmadd132ps %ymm0, %ymm8, %ymm1 vfmadd132ps %ymm0, %ymm7, %ymm1 vfmadd132ps %ymm0, %ymm5, %ymm1 vfmadd132ps %ymm0, %ymm4, %ymm1 vfmadd132ps %ymm1, %ymm4, %ymm0 vaddps %ymm0, %ymm3, %ymm3 decl %eax jne .L3 incl %edx cmpl $12, %edx jne .L2