https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107160

--- Comment #14 from Richard Biener <rguenth at gcc dot gnu.org> ---
Aha, so the issue is that we have a vectorized epilogue here and the epilogue
of _that_ ends up doing


  <bb 11> [local count: 94607391]:
  # sum0_48 = PHI <sum0_28(3)>
  # sum1_47 = PHI <sum1_29(3)>
  # sum2_46 = PHI <sum2_30(3)>
  # sum3_45 = PHI <sum3_31(3)>
  # vect_sum3_31.16_101 = PHI <vect_sum3_31.16_97(3)>
  # vect_sum3_31.16_102 = PHI <vect_sum3_31.16_98(3)>
  # vect_sum3_31.16_103 = PHI <vect_sum3_31.16_99(3)>
  # vect_sum3_31.16_104 = PHI <vect_sum3_31.16_100(3)>
  _105 = BIT_FIELD_REF <vect_sum3_31.16_101, 32, 0>;
  _106 = BIT_FIELD_REF <vect_sum3_31.16_101, 32, 32>;
...
this is from the main vect

  <bb 17> [local count: 81467476]:
  # sum0_135 = PHI <sum0_62(12)>
  # sum1_136 = PHI <sum1_58(12)>
  # sum2_137 = PHI <sum2_54(12)>
  # sum3_138 = PHI <sum3_50(12)>
  # vect_sum3_50.25_159 = PHI <vect_sum3_50.25_158(12)>
this is from the epilogue

  <bb 15> [local count: 105119324]:
  # sum0_15 = PHI <sum0_135(17), _129(11)>
  # sum1_77 = PHI <sum1_136(17), _130(11)>
  # sum2_75 = PHI <sum2_137(17), _131(11)>
  # sum3_13 = PHI <sum3_138(17), _132(11)>
  # _160 = PHI <vect_sum3_50.25_159(17), vect_sum3_31.16_101(11)>
  _161 = BIT_FIELD_REF <_160, 32, 0>;
  _162 = BIT_FIELD_REF <_160, 32, 32>;
  _163 = BIT_FIELD_REF <_160, 32, 64>;
  _164 = BIT_FIELD_REF <_160, 32, 96>;
  _74 = _161 + _162;
  _76 = _74 + _163;
  _78 = _76 + _164;

so we fail to accumulate the main loops accumulators and just
use the first one.  On x86 the vectorized epilogue uses a smaller
vector size but the same number of accumulators.

It seems it's simply unexpected to have the unrolled SLP reduction
and a vectorized epilogue with the same vector mode (but not unrolled).

I can reproduce the failure when patching the x86 cost model to force
unrolling by 2 (maybe we want a --param to force that to aid debugging...).

Reply via email to