https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57952

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
      Known to work|                            |10.1.0
         Resolution|---                         |FIXED
             Status|UNCONFIRMED                 |RESOLVED

--- Comment #10 from Richard Biener <rguenth at gcc dot gnu.org> ---
I can now see this vectorized with at least GCC 10 and up.  Note we're
vectorizing the _outer_ loop here but we also manage to vectorize the
inner loop only if I comment out the outer one, it just looks less efficient.

.L2:
        vmovdqa %ymm6, %ymm2
        movl    $10000000, %eax
        .p2align 4,,10
        .p2align 3
.L3:
        vmovdqa %ymm2, %ymm0
        vpaddd  %ymm6, %ymm2, %ymm2
        vcvtdq2ps       %ymm0, %ymm0
        vaddps  %ymm5, %ymm0, %ymm0
        vmulps  %ymm11, %ymm0, %ymm0
        vmovaps %ymm0, %ymm1
        vfmadd132ps     %ymm10, %ymm9, %ymm1
        vfmadd132ps     %ymm0, %ymm8, %ymm1
        vfmadd132ps     %ymm0, %ymm7, %ymm1
        vfmadd132ps     %ymm0, %ymm5, %ymm1
        vfmadd132ps     %ymm0, %ymm4, %ymm1
        vfmadd132ps     %ymm1, %ymm4, %ymm0
        vaddps  %ymm0, %ymm3, %ymm3
        decl    %eax
        jne     .L3
        incl    %edx
        cmpl    $12, %edx
        jne     .L2

Reply via email to