https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101895
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Ever confirmed|0 |1 Status|UNCONFIRMED |NEW Priority|P3 |P2 Target| |x86_64-*-* Last reconfirmed| |2021-08-16 --- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> --- Confirmed. void foo(float * restrict a, float b, float *c) { a[0] = c[0]*b + a[0]; a[1] = c[2]*b + a[1]; a[2] = c[1]*b + a[2]; a[3] = c[3]*b + a[3]; } shows the issue on x86_64 with a lack of an FMA. One complication is that FMA forming is done in a later pass only thus the vectorizer has no guidance to decide on placement of the permute. Note the vectorizer permute optimization propagates in one direction only (but the optimistic pieces), the intent is to reduce the number of permutes which almost exclusively come from loads.