https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107
Andrew Pinski <pinskia at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Target| |x86_64-linux --- Comment #2 from Andrew Pinski <pinskia at gcc dot gnu.org> --- I am not 100% sure that is always better. What is happening is GCC is vectorizing even the outer loop. It is easier to understand via aarch64 asm too: .L4: ldr q27, [x3], 16 ld4 {v28.2d - v31.2d}, [x4] fmul v24.2d, v27.2d, v28.2d fmul v25.2d, v27.2d, v29.2d fmul v26.2d, v27.2d, v30.2d fmul v27.2d, v27.2d, v31.2d st4 {v24.2d - v27.2d}, [x4], 64 cmp x3, x5 bne .L4 Have you benchmarked both? If anything this is a cost model issue.