https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107
--- Comment #5 from Andrew Pinski <pinskia at gcc dot gnu.org> --- (In reply to N Schaeffer from comment #3) > If this is a cost model problem, it is a bad one. It is almost definitely a cost model in the x86_64 backend issue. Because I tried on aarch64 with -march=armv9-a+sve and then we get only the vectorization of the inner loop for both -O2 and -O3: ``` .L3: ldp q29, q30, [x0] ld1r {v31.2d}, [x1], 8 fmul v30.2d, v30.2d, v31.2d fmul v29.2d, v29.2d, v31.2d stp q29, q30, [x0], 32 cmp x2, x1 bne .L3 ``` With the default generic armv8-a cost model we get the ld4 there and vectorizing the outer loop.