https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102977

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|---                         |INVALID
             Status|UNCONFIRMED                 |RESOLVED

--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Huh.
The trunk code is vectorized all the way:
        ptrue   p1.h, vl8 ; set p1.h to 8 wide
        ptrue   p0.b, all ; set p0.b to all ones
        ld2h    {z2.h - z3.h}, p1/z, [x1] ; load the 8x2 vector into z2/z3
        ld2h    {z0.h - z1.h}, p1/z, [x2] ; load the 8x2 vector into z0/z1
        ld2h    {z16.h - z17.h}, p1/z, [x0] ; load the 8x2 vector into z16/17
        fmul    z6.h, z0.h, z3.h ; z6 = z0 * z3
        movprfx z7, z16          ; z7 = z16
        fmla    z7.h, p0/m, z0.h, z2.h ; z7+=z0*z2
        fmla    z6.h, p0/m, z1.h, z2.h ; z6 += z1*z2
        movprfx z4, z7                 ; z4 = z7
        fmls    z4.h, p0/m, z1.h, z3.h ; z4 -= z1*z3
        fadd    z5.h, z6.h, z17.h      ; z5 = z6 + z17
        st2h    {z4.h - z5.h}, p1, [x0] ; store the 8x2 vector into x0


note the way ld2 works is the first element goes into the first vector, second
element goes into the second vector, the 3rd element goes into the first
vector, the 4th element goes into the second vector.

So this is optimized all the way. Knowing the lower limit of the size of the
vectors will be 128 byte (or 64 half floats) so 8 half floats will always fit
into one vector just fine.
So this is vectorized all the way such that it is unrolled even.

Reply via email to