https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107160
--- Comment #8 from Richard Biener <rguenth at gcc dot gnu.org> --- (In reply to Kewen Lin from comment #7) > One reduced test case is: > > ============================================================ > > #include <stdio.h> > #include <math.h> > > #define N 128 > float fl[N]; > > __attribute__ ((noipa, optimize (0))) void > init () > { > for (int i = 0; i < N; i++) > fl[i] = i; > } > > __attribute__ ((noipa)) float > foo (int n1) > { > float sum0, sum1, sum2, sum3; > sum0 = sum1 = sum2 = sum3 = 0.0f; > > int n = (n1 / 4) * 4; > for (int i = 0; i < n; i += 4) > { > sum0 += fabs (fl[i]); > sum1 += fabs (fl[i + 1]); > sum2 += fabs (fl[i + 2]); > sum3 += fabs (fl[i + 3]); > } > > return sum0 + sum1 + sum2 + sum3; > } > > __attribute__ ((optimize (0))) int > main () > { > init (); > float res = foo (80); > __builtin_printf ("res:%f\n", res); > return 0; > } > > ============================================================ > incorrect result "res:670.000000" vs expected result "res:3160.000000" > > It looks it exposes one bug in vectorization reduction support. The > reduction epilogue handling looks wrong, it generates gimple code like: > > # vect_sum3_31.16_101 = PHI <vect_sum3_31.16_97(3)> > # vect_sum3_31.16_102 = PHI <vect_sum3_31.16_98(3)> > # vect_sum3_31.16_103 = PHI <vect_sum3_31.16_99(3)> > # vect_sum3_31.16_104 = PHI <vect_sum3_31.16_100(3)> > _105 = BIT_FIELD_REF <vect_sum3_31.16_101, 32, 0>; > _106 = BIT_FIELD_REF <vect_sum3_31.16_101, 32, 32>; > _107 = BIT_FIELD_REF <vect_sum3_31.16_101, 32, 64>; > _108 = BIT_FIELD_REF <vect_sum3_31.16_101, 32, 96>; > _109 = BIT_FIELD_REF <vect_sum3_31.16_102, 32, 0>; > _110 = BIT_FIELD_REF <vect_sum3_31.16_102, 32, 32>; > _111 = BIT_FIELD_REF <vect_sum3_31.16_102, 32, 64>; > _112 = BIT_FIELD_REF <vect_sum3_31.16_102, 32, 96>; > ... > > it doesn't consider the reduced results vect_sum3_31.16_10{1,2,3,4} from the > loop can be reduced again in loop exit block as they are in the same slp > group. The above doesn't look wrong (but may miss the rest of the IL). On x86_64 this looks like <bb 4> [local count: 105119324]: # sum0_41 = PHI <sum0_28(3)> # sum1_39 = PHI <sum1_29(3)> # sum2_37 = PHI <sum2_30(3)> # sum3_35 = PHI <sum3_31(3)> # vect_sum3_31.11_59 = PHI <vect_sum3_31.11_60(3)> _58 = BIT_FIELD_REF <vect_sum3_31.11_59, 32, 0>; _57 = BIT_FIELD_REF <vect_sum3_31.11_59, 32, 32>; _56 = BIT_FIELD_REF <vect_sum3_31.11_59, 32, 64>; _55 = BIT_FIELD_REF <vect_sum3_31.11_59, 32, 96>; _74 = _58 + _57; _76 = _56 + _74; _78 = _55 + _76; <bb 5> [local count: 118111600]: # prephitmp_79 = PHI <_78(4), 0.0(2)> return prephitmp_79; when unrolling is applied, thus with a larger VF, you should ideally see the vectors accumulated. Btw, I've fixed a SLP reduction issue two days ago in r13-3226-gee467644c53ee2 though that looks unrelated? When I force a larger VF on x86 by adding a int store in the loop I see <bb 11> [local count: 94607391]: # sum0_48 = PHI <sum0_29(3)> # sum1_36 = PHI <sum1_30(3)> # sum2_35 = PHI <sum2_31(3)> # sum3_24 = PHI <sum3_32(3)> # vect_sum3_32.16_110 = PHI <vect_sum3_32.16_106(3)> # vect_sum3_32.16_111 = PHI <vect_sum3_32.16_107(3)> # vect_sum3_32.16_112 = PHI <vect_sum3_32.16_108(3)> # vect_sum3_32.16_113 = PHI <vect_sum3_32.16_109(3)> _114 = BIT_FIELD_REF <vect_sum3_32.16_110, 32, 0>; _115 = BIT_FIELD_REF <vect_sum3_32.16_110, 32, 32>; _116 = BIT_FIELD_REF <vect_sum3_32.16_110, 32, 64>; _117 = BIT_FIELD_REF <vect_sum3_32.16_110, 32, 96>; _118 = BIT_FIELD_REF <vect_sum3_32.16_111, 32, 0>; _119 = BIT_FIELD_REF <vect_sum3_32.16_111, 32, 32>; _120 = BIT_FIELD_REF <vect_sum3_32.16_111, 32, 64>; _121 = BIT_FIELD_REF <vect_sum3_32.16_111, 32, 96>; _122 = BIT_FIELD_REF <vect_sum3_32.16_112, 32, 0>; _123 = BIT_FIELD_REF <vect_sum3_32.16_112, 32, 32>; _124 = BIT_FIELD_REF <vect_sum3_32.16_112, 32, 64>; _125 = BIT_FIELD_REF <vect_sum3_32.16_112, 32, 96>; _126 = BIT_FIELD_REF <vect_sum3_32.16_113, 32, 0>; _127 = BIT_FIELD_REF <vect_sum3_32.16_113, 32, 32>; _128 = BIT_FIELD_REF <vect_sum3_32.16_113, 32, 64>; _129 = BIT_FIELD_REF <vect_sum3_32.16_113, 32, 96>; _130 = _114 + _118; _131 = _115 + _119; _132 = _116 + _120; _133 = _117 + _121; _134 = _130 + _122; _135 = _131 + _123; _136 = _132 + _124; _137 = _133 + _125; _138 = _134 + _126; see how the lanes from the different vectors are accumulated? (yeah, we should simply add the vectors!)