https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107160

--- Comment #8 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Kewen Lin from comment #7)
> One reduced test case is:
> 
> ============================================================
> 
> #include <stdio.h>
> #include <math.h>
> 
> #define N 128
> float fl[N];
> 
> __attribute__ ((noipa, optimize (0))) void
> init ()
> {
>   for (int i = 0; i < N; i++)
>     fl[i] = i;
> }
> 
> __attribute__ ((noipa)) float
> foo (int n1)
> {
>   float sum0, sum1, sum2, sum3;
>   sum0 = sum1 = sum2 = sum3 = 0.0f;
> 
>   int n = (n1 / 4) * 4;
>   for (int i = 0; i < n; i += 4)
>     {
>       sum0 += fabs (fl[i]);
>       sum1 += fabs (fl[i + 1]);
>       sum2 += fabs (fl[i + 2]);
>       sum3 += fabs (fl[i + 3]);
>     }
> 
>   return sum0 + sum1 + sum2 + sum3;
> }
> 
> __attribute__ ((optimize (0))) int
> main ()
> {
>   init ();
>   float res = foo (80);
>   __builtin_printf ("res:%f\n", res);
>   return 0;
> }
> 
> ============================================================ 
> incorrect result "res:670.000000" vs expected result "res:3160.000000"
> 
> It looks it exposes one bug in vectorization reduction support. The
> reduction epilogue handling looks wrong, it generates gimple code like:
> 
>     # vect_sum3_31.16_101 = PHI <vect_sum3_31.16_97(3)>
>     # vect_sum3_31.16_102 = PHI <vect_sum3_31.16_98(3)>
>     # vect_sum3_31.16_103 = PHI <vect_sum3_31.16_99(3)>
>     # vect_sum3_31.16_104 = PHI <vect_sum3_31.16_100(3)>
>     _105 = BIT_FIELD_REF <vect_sum3_31.16_101, 32, 0>;
>     _106 = BIT_FIELD_REF <vect_sum3_31.16_101, 32, 32>;
>     _107 = BIT_FIELD_REF <vect_sum3_31.16_101, 32, 64>;
>     _108 = BIT_FIELD_REF <vect_sum3_31.16_101, 32, 96>;
>     _109 = BIT_FIELD_REF <vect_sum3_31.16_102, 32, 0>;
>     _110 = BIT_FIELD_REF <vect_sum3_31.16_102, 32, 32>;
>     _111 = BIT_FIELD_REF <vect_sum3_31.16_102, 32, 64>;
>     _112 = BIT_FIELD_REF <vect_sum3_31.16_102, 32, 96>;
> ...
> 
> it doesn't consider the reduced results vect_sum3_31.16_10{1,2,3,4} from the
> loop can be reduced again in loop exit block as they are in the same slp
> group.

The above doesn't look wrong (but may miss the rest of the IL).  On
x86_64 this looks like

  <bb 4> [local count: 105119324]:
  # sum0_41 = PHI <sum0_28(3)>
  # sum1_39 = PHI <sum1_29(3)>
  # sum2_37 = PHI <sum2_30(3)>
  # sum3_35 = PHI <sum3_31(3)>
  # vect_sum3_31.11_59 = PHI <vect_sum3_31.11_60(3)>
  _58 = BIT_FIELD_REF <vect_sum3_31.11_59, 32, 0>;
  _57 = BIT_FIELD_REF <vect_sum3_31.11_59, 32, 32>;
  _56 = BIT_FIELD_REF <vect_sum3_31.11_59, 32, 64>;
  _55 = BIT_FIELD_REF <vect_sum3_31.11_59, 32, 96>;
  _74 = _58 + _57;
  _76 = _56 + _74;
  _78 = _55 + _76;

  <bb 5> [local count: 118111600]:
  # prephitmp_79 = PHI <_78(4), 0.0(2)>
  return prephitmp_79;

when unrolling is applied, thus with a larger VF, you should ideally
see the vectors accumulated.

Btw, I've fixed a SLP reduction issue two days ago in r13-3226-gee467644c53ee2
though that looks unrelated?

When I force a larger VF on x86 by adding a int store in the loop I see

  <bb 11> [local count: 94607391]:
  # sum0_48 = PHI <sum0_29(3)>
  # sum1_36 = PHI <sum1_30(3)>
  # sum2_35 = PHI <sum2_31(3)>
  # sum3_24 = PHI <sum3_32(3)>
  # vect_sum3_32.16_110 = PHI <vect_sum3_32.16_106(3)>
  # vect_sum3_32.16_111 = PHI <vect_sum3_32.16_107(3)>
  # vect_sum3_32.16_112 = PHI <vect_sum3_32.16_108(3)>
  # vect_sum3_32.16_113 = PHI <vect_sum3_32.16_109(3)>
  _114 = BIT_FIELD_REF <vect_sum3_32.16_110, 32, 0>;
  _115 = BIT_FIELD_REF <vect_sum3_32.16_110, 32, 32>;
  _116 = BIT_FIELD_REF <vect_sum3_32.16_110, 32, 64>;
  _117 = BIT_FIELD_REF <vect_sum3_32.16_110, 32, 96>;
  _118 = BIT_FIELD_REF <vect_sum3_32.16_111, 32, 0>;
  _119 = BIT_FIELD_REF <vect_sum3_32.16_111, 32, 32>;
  _120 = BIT_FIELD_REF <vect_sum3_32.16_111, 32, 64>;
  _121 = BIT_FIELD_REF <vect_sum3_32.16_111, 32, 96>;
  _122 = BIT_FIELD_REF <vect_sum3_32.16_112, 32, 0>;
  _123 = BIT_FIELD_REF <vect_sum3_32.16_112, 32, 32>;
  _124 = BIT_FIELD_REF <vect_sum3_32.16_112, 32, 64>;
  _125 = BIT_FIELD_REF <vect_sum3_32.16_112, 32, 96>;
  _126 = BIT_FIELD_REF <vect_sum3_32.16_113, 32, 0>;
  _127 = BIT_FIELD_REF <vect_sum3_32.16_113, 32, 32>;
  _128 = BIT_FIELD_REF <vect_sum3_32.16_113, 32, 64>;
  _129 = BIT_FIELD_REF <vect_sum3_32.16_113, 32, 96>;
  _130 = _114 + _118;
  _131 = _115 + _119;
  _132 = _116 + _120;
  _133 = _117 + _121;
  _134 = _130 + _122;
  _135 = _131 + _123;
  _136 = _132 + _124;
  _137 = _133 + _125;
  _138 = _134 + _126;

see how the lanes from the different vectors are accumulated?  (yeah,
we should simply add the vectors!)

Reply via email to