https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84114
Andrew Pinski <pinskia at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Known to work| |12.1.0 --- Comment #12 from Andrew Pinski <pinskia at gcc dot gnu.org> --- Starting in GCC 12 we get on arm64 (with -Ofast): ``` mult_su3_na: ldp q3, q1, [x1, 16] ldr q0, [x0, 32] ldp q2, q4, [x0] fmul v0.2d, v0.2d, v1.2d ldr q1, [x1] fmla v0.2d, v4.2d, v3.2d fmla v0.2d, v2.2d, v1.2d faddp d0, v0.2d ret ``` Which is better than before even. (similarly on x86_64 with -mfma) due to SLP happening. With -fno-tree-vectorize, -Ofast is slightly on x86_64 better than 13 by one instruction. I am not sure if this matters any more due to the SLP improvement ...