https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107
Andrew Pinski <pinskia at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Target| |x86_64-linux
--- Comment #2 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
I am not 100% sure that is always better.
What is happening is GCC is vectorizing even the outer loop.
It is easier to understand via aarch64 asm too:
.L4:
ldr q27, [x3], 16
ld4 {v28.2d - v31.2d}, [x4]
fmul v24.2d, v27.2d, v28.2d
fmul v25.2d, v27.2d, v29.2d
fmul v26.2d, v27.2d, v30.2d
fmul v27.2d, v27.2d, v31.2d
st4 {v24.2d - v27.2d}, [x4], 64
cmp x3, x5
bne .L4
Have you benchmarked both?
If anything this is a cost model issue.