https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70138

--- Comment #5 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
I've already spent some time on this last night.  It fails even when foo is not
inlined:

double u[1782225];

__attribute__((noinline, noclone)) static void
foo (int *x)
{
  double c = 0.0;
  int a, b;
  for (a = 0; a < 1335; a++)
    {
      for (b = 0; b < 1335; b++)
        c = c + u[1336 * a];
      u[1336 * a] *= 2.0;
    }
  *x = c;
}

int
main ()
{
  int d, e;
  for (d = 0; d < 1782225; d++)
    u[d] = 2.0;
  foo (&e);
  if (e != 3564450)
    __builtin_abort ();
  return 0;
}

This is outer loop vectorization, which processes two outer loop iterations at
the same time.  What I find wrong is at the place where we reduce the result,
we emit:
  <bb 12>:
  # c_42 = PHI <c_22(6)>
  # a_44 = PHI <a_17(6)>
  # ivtmp_47 = PHI <ivtmp_3(6)>
  # vect_c_10.8_64 = PHI <vect_c_10.8_63(6)>
  # pretmp_82 = PHI <pretmp_28(6)>
  vect_c_10.11_65 = VEC_PERM_EXPR <vect_c_10.8_64, { 0.0, 0.0 }, { 1, 2 }>;
  vect_c_10.11_66 = vect_c_10.11_65 + vect_c_10.8_64;
  stmp_c_10.10_67 = BIT_FIELD_REF <vect_c_10.11_66, 64, 0>;
  tmp.5_53 = pretmp_82 * 1.78089e+6;
  goto <bb 7>;
vect_c_10.8_63 holds here the partial sums of c from the first 2 * 667
iterations of the outer loop (low element from even and high element from odd
iterations).  So, I'd very much expect that stmp_c_10.10_6 would be what is
used in the PHI of the final scalar loop, but the stmp_c_10.10_67 SSA_NAME is
actually unused and instead tmp.5_53 = pretmp_82 * 1.78089e+6; computes
something weird (pretmp_82 at this point is u[1336 * (667 - 1)]).
It is true that some optimization for -Ofast could/should figure out that
      for (b = 0; b < 1335; b++)
        c = c + u[1336 * a];
is actually c = c + u[1336 * a] * 1335.0;, but that clearly didn't happen until
vectorization and while in the testcase all u elements are initially equal,
that is not given.  As a proof, I've tried to modify in the debugger the value
of u[1336 * (667 - 1)] before the call to 6.0 and 32.0 and the final value of e
has been roughly scaled accordingly (as the vectorized loop doubles all u
elements with indexes that are multiple of 1336, at that point the value is
twice the original value, and we then multiply it by 178089 and add the sum
from the trailing scalar outer loop.

Reply via email to