https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101340
rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |rsandifo at gcc dot gnu.org --- Comment #3 from rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> --- Not sure if this is exactly the same issue (I can file a separate PR if it's not), but there's a similar inefficiency in gcc.dg/vect/pr97832-2.c. There we unroll: #pragma GCC unroll 4 for (int k = 0; k < 4; ++k) { double x_re = x[c+0+k]; double x_im = x[c+4+k]; double y_re = y[c+0+k]; double y_im = y[c+4+k]; y_re = y_re - x_re * f_re - x_im * f_im;; y_im = y_im + x_re * f_im - x_im * f_re; y[c+0+k] = y_re; y[c+4+k] = y_im; } The depth of the y_re and x_re calculations for k==0 are one less than for k>0, due to the extra c+N additions for the latter. k==0 therefore gets a lower reassociation rank, so we end up with: _65 = f_re_34 * x_re_54; _66 = y_re_62 - _65; _67 = f_im_35 * x_im_60; y_re_68 = _66 - _67; for k==0 but: _93 = f_re_34 * x_re_82; _95 = f_im_35 * x_im_88; _41 = _93 + _95; y_re_96 = y_re_90 - _41; etc. for k>0. This persists into the SLP code, where we use the following load permutes: load permutation { 4 1 2 3 0 1 2 3 } load permutation { 0 5 6 7 4 5 6 7 } With different reassociation we could have used: load permutation { 0 1 2 3 0 1 2 3 } load permutation { 4 5 6 7 4 5 6 7 } instead.