[Bug tree-optimization/79336] Poor vectorisation of additive reduction of complex array, final SLP reduction step inefficient

rguenth at gcc dot gnu.org Thu, 02 Feb 2017 04:06:37 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79336


Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |missed-optimization
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2017-02-02
          Component|c                           |tree-optimization
             Blocks|                            |53947
            Summary|Poor vectorisation of       |Poor vectorisation of
                   |additive reduction of       |additive reduction of
                   |complex array               |complex array, final SLP
                   |                            |reduction step inefficient
     Ever confirmed|0                           |1

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
Confirmed.  The reduction loop itself is fine, it is the final reduction step
involving the SLP reduction result (we reduce two scalars) that is handled
less than optimally:

  <bb 3> [96.97%]:
  # i_16 = PHI <i_11(4), 0(2)>
  # p$real_13 = PHI <_17(4), 1.0e+0(2)>
  # p$imag_14 = PHI <_18(4), 0.0(2)>
  # ivtmp_34 = PHI <ivtmp_33(4), 32(2)>
  _1 = (long unsigned int) i_16;
  _2 = _1 * 8;
  _3 = x_9(D) + _2;
  _7 = REALPART_EXPR <*_3>;
  _12 = IMAGPART_EXPR <*_3>;
  _17 = _7 + p$real_13;
  _18 = _12 + p$imag_14;
  i_11 = i_16 + 1;
  ivtmp_33 = ivtmp_34 - 1;
  if (ivtmp_33 != 0)
    goto <bb 4>; [96.88%]
  else
    goto <bb 5>; [3.12%]

  <bb 4> [93.94%]:
  goto <bb 3>; [100.00%]

  <bb 5> [3.03%]:
  # _36 = PHI <_17(3)>
  # _35 = PHI <_18(3)>
  p_10 = COMPLEX_EXPR <_36, _35>;

here we simply try to first produce _36 and _35 from the vectorized reduction
result and then build the complex function result:

  <bb 5> [3.03%]:
  # _36 = PHI <_17(3)>
  # _35 = PHI <_18(3)>
  # vect__17.8_22 = PHI <vect__17.8_24(3)>
  stmp__17.9_21 = BIT_FIELD_REF <vect__17.8_22, 32, 0>;
  stmp__17.9_20 = BIT_FIELD_REF <vect__17.8_22, 32, 32>;
  stmp__17.9_19 = BIT_FIELD_REF <vect__17.8_22, 32, 64>;
  stmp__17.9_15 = BIT_FIELD_REF <vect__17.8_22, 32, 96>;
  stmp__17.9_6 = BIT_FIELD_REF <vect__17.8_22, 32, 128>;
  stmp__17.9_5 = BIT_FIELD_REF <vect__17.8_22, 32, 160>;
  stmp__17.9_4 = BIT_FIELD_REF <vect__17.8_22, 32, 192>;
  stmp__17.9_29 = BIT_FIELD_REF <vect__17.8_22, 32, 224>;
  stmp__17.9_28 = stmp__17.9_21 + stmp__17.9_19;
  stmp__17.9_27 = stmp__17.9_20 + stmp__17.9_15;
  stmp__17.9_26 = stmp__17.9_28 + stmp__17.9_6;
  stmp__17.9_37 = stmp__17.9_27 + stmp__17.9_5;
  stmp__17.9_38 = stmp__17.9_26 + stmp__17.9_4;
  stmp__17.9_39 = stmp__17.9_37 + stmp__17.9_29;
  p_10 = COMPLEX_EXPR <stmp__17.9_38, stmp__17.9_39>;
  return p_10;

this doesn't take advantage from the fact that we can do this kind
final SLP reduction more efficiently (didn't try to decipher exactly
what ICC does here).  It may require ABI details or knowing that
we can type-pun a vector to a complex...  (but only for complex float,
for complex double the ABI doesn't work out this way!)


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations

[Bug tree-optimization/79336] Poor vectorisation of additive reduction of complex array, final SLP reduction step inefficient

Reply via email to