https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107

--- Comment #5 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
(In reply to N Schaeffer from comment #3)
> If this is a cost model problem, it is a bad one.

It is almost definitely a cost model in the x86_64 backend issue. Because I
tried on aarch64 with -march=armv9-a+sve and then we get only the vectorization
of the inner loop for both -O2 and -O3:
```
.L3:
        ldp     q29, q30, [x0]
        ld1r    {v31.2d}, [x1], 8
        fmul    v30.2d, v30.2d, v31.2d
        fmul    v29.2d, v29.2d, v31.2d
        stp     q29, q30, [x0], 32
        cmp     x2, x1
        bne     .L3
```

With the default generic armv8-a cost model we get the ld4 there and
vectorizing the outer loop.

Reply via email to