https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832
--- Comment #7 from Richard Biener <rguenth at gcc dot gnu.org> --- Or double a[1024], b[1024], c[1024]; void foo() { for (int i = 0; i < 256; ++i) { a[2*i] = 1. - a[2*i] + b[2*i]; a[2*i+1] = 1 + a[2*i+1] - b[2*i+1]; } } which early folding breaks unless we add -fno-associative-math. We then end up with a[_1] = (((b[_1]) - (a[_1])) + 1.0e+0); a[_6] = (((a[_6]) - (b[_6])) + 1.0e+0); where SLP operator swaping cannot handle to bring the grouped loads into the same lanes. So the idea is to look at single-use chains of plus/minus operations and handle those as wide associated SLP nodes with flags denoting which lanes need negation. We'd have three children and each child has a per-lane spec whether to add or subtract.