https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Target|                            |x86_64-linux

--- Comment #2 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
I am not 100% sure that is always better.

What is happening is GCC is vectorizing even the outer loop.

It is easier to understand via aarch64 asm too:
.L4:
        ldr     q27, [x3], 16
        ld4     {v28.2d - v31.2d}, [x4]
        fmul    v24.2d, v27.2d, v28.2d
        fmul    v25.2d, v27.2d, v29.2d
        fmul    v26.2d, v27.2d, v30.2d
        fmul    v27.2d, v27.2d, v31.2d
        st4     {v24.2d - v27.2d}, [x4], 64
        cmp     x3, x5
        bne     .L4

Have you benchmarked both?

If anything this is a cost model issue.

Reply via email to