https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102977
Andrew Pinski <pinskia at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Resolution|--- |INVALID Status|UNCONFIRMED |RESOLVED --- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> --- Huh. The trunk code is vectorized all the way: ptrue p1.h, vl8 ; set p1.h to 8 wide ptrue p0.b, all ; set p0.b to all ones ld2h {z2.h - z3.h}, p1/z, [x1] ; load the 8x2 vector into z2/z3 ld2h {z0.h - z1.h}, p1/z, [x2] ; load the 8x2 vector into z0/z1 ld2h {z16.h - z17.h}, p1/z, [x0] ; load the 8x2 vector into z16/17 fmul z6.h, z0.h, z3.h ; z6 = z0 * z3 movprfx z7, z16 ; z7 = z16 fmla z7.h, p0/m, z0.h, z2.h ; z7+=z0*z2 fmla z6.h, p0/m, z1.h, z2.h ; z6 += z1*z2 movprfx z4, z7 ; z4 = z7 fmls z4.h, p0/m, z1.h, z3.h ; z4 -= z1*z3 fadd z5.h, z6.h, z17.h ; z5 = z6 + z17 st2h {z4.h - z5.h}, p1, [x0] ; store the 8x2 vector into x0 note the way ld2 works is the first element goes into the first vector, second element goes into the second vector, the 3rd element goes into the first vector, the 4th element goes into the second vector. So this is optimized all the way. Knowing the lower limit of the size of the vectors will be 128 byte (or 64 half floats) so 8 half floats will always fit into one vector just fine. So this is vectorized all the way such that it is unrolled even.