https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101340

rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rsandifo at gcc dot gnu.org

--- Comment #3 from rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> 
---
Not sure if this is exactly the same issue (I can file a separate
PR if it's not), but there's a similar inefficiency in
gcc.dg/vect/pr97832-2.c.  There we unroll:

#pragma GCC unroll 4
    for (int k = 0; k < 4; ++k) {
      double x_re = x[c+0+k];
      double x_im = x[c+4+k];
      double y_re = y[c+0+k];
      double y_im = y[c+4+k];
      y_re = y_re - x_re * f_re - x_im * f_im;;
      y_im = y_im + x_re * f_im - x_im * f_re;
      y[c+0+k] = y_re;
      y[c+4+k] = y_im;
    }

The depth of the y_re and x_re calculations for k==0 are one
less than for k>0, due to the extra c+N additions for the latter.
k==0 therefore gets a lower reassociation rank, so we end
up with:

  _65 = f_re_34 * x_re_54;
  _66 = y_re_62 - _65;
  _67 = f_im_35 * x_im_60;
  y_re_68 = _66 - _67;

for k==0 but:

  _93 = f_re_34 * x_re_82;
  _95 = f_im_35 * x_im_88;
  _41 = _93 + _95;
  y_re_96 = y_re_90 - _41;

etc. for k>0.  This persists into the SLP code, where we use
the following load permutes:

  load permutation { 4 1 2 3 0 1 2 3 }
  load permutation { 0 5 6 7 4 5 6 7 }

With different reassociation we could have used:

  load permutation { 0 1 2 3 0 1 2 3 }
  load permutation { 4 5 6 7 4 5 6 7 }

instead.

Reply via email to