https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107160
--- Comment #14 from Richard Biener <rguenth at gcc dot gnu.org> --- Aha, so the issue is that we have a vectorized epilogue here and the epilogue of _that_ ends up doing <bb 11> [local count: 94607391]: # sum0_48 = PHI <sum0_28(3)> # sum1_47 = PHI <sum1_29(3)> # sum2_46 = PHI <sum2_30(3)> # sum3_45 = PHI <sum3_31(3)> # vect_sum3_31.16_101 = PHI <vect_sum3_31.16_97(3)> # vect_sum3_31.16_102 = PHI <vect_sum3_31.16_98(3)> # vect_sum3_31.16_103 = PHI <vect_sum3_31.16_99(3)> # vect_sum3_31.16_104 = PHI <vect_sum3_31.16_100(3)> _105 = BIT_FIELD_REF <vect_sum3_31.16_101, 32, 0>; _106 = BIT_FIELD_REF <vect_sum3_31.16_101, 32, 32>; ... this is from the main vect <bb 17> [local count: 81467476]: # sum0_135 = PHI <sum0_62(12)> # sum1_136 = PHI <sum1_58(12)> # sum2_137 = PHI <sum2_54(12)> # sum3_138 = PHI <sum3_50(12)> # vect_sum3_50.25_159 = PHI <vect_sum3_50.25_158(12)> this is from the epilogue <bb 15> [local count: 105119324]: # sum0_15 = PHI <sum0_135(17), _129(11)> # sum1_77 = PHI <sum1_136(17), _130(11)> # sum2_75 = PHI <sum2_137(17), _131(11)> # sum3_13 = PHI <sum3_138(17), _132(11)> # _160 = PHI <vect_sum3_50.25_159(17), vect_sum3_31.16_101(11)> _161 = BIT_FIELD_REF <_160, 32, 0>; _162 = BIT_FIELD_REF <_160, 32, 32>; _163 = BIT_FIELD_REF <_160, 32, 64>; _164 = BIT_FIELD_REF <_160, 32, 96>; _74 = _161 + _162; _76 = _74 + _163; _78 = _76 + _164; so we fail to accumulate the main loops accumulators and just use the first one. On x86 the vectorized epilogue uses a smaller vector size but the same number of accumulators. It seems it's simply unexpected to have the unrolled SLP reduction and a vectorized epilogue with the same vector mode (but not unrolled). I can reproduce the failure when patching the x86 cost model to force unrolling by 2 (maybe we want a --param to force that to aid debugging...).