https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107
--- Comment #5 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
(In reply to N Schaeffer from comment #3)
> If this is a cost model problem, it is a bad one.
It is almost definitely a cost model in the x86_64 backend issue. Because I
tried on aarch64 with -march=armv9-a+sve and then we get only the vectorization
of the inner loop for both -O2 and -O3:
```
.L3:
ldp q29, q30, [x0]
ld1r {v31.2d}, [x1], 8
fmul v30.2d, v30.2d, v31.2d
fmul v29.2d, v29.2d, v31.2d
stp q29, q30, [x0], 32
cmp x2, x1
bne .L3
```
With the default generic armv8-a cost model we get the ld4 there and
vectorizing the outer loop.