[Bug tree-optimization/90018] [8 Regression] r265453 miscompiled 527.cam4_r in SPEC CPU 2017

rguenth at gcc dot gnu.org Wed, 10 Apr 2019 02:44:28 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90018


Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |ASSIGNED

--- Comment #14 from Richard Biener <rguenth at gcc dot gnu.org> ---
So the important difference when comparing patched/unpatched is the unpatched
compiler rejected vectorization with

mapz_module.fppized.f90:730:0: note: dependence distance == 0 between
*a4_627(D)[_196] and *a4_627(D)[_196]
mapz_module.fppized.f90:730:0: note: READ_WRITE dependence in interleaving.
mapz_module.fppized.f90:730:0: note: bad data dependence.

while the patched compiler is happy.  That points to the patched function
and it's call here:

static bool
vect_analyze_data_ref_dependence (struct data_dependence_relation *ddr,
                                  loop_vec_info loop_vinfo,
                                  unsigned int *max_vf)
{ 
...
      if (dist == 0) 
        {       
...
          if (!vect_preserves_scalar_order_p (DR_STMT (dra), DR_STMT (drb)))
            {
              if (dump_enabled_p ())
                dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
                                 "READ_WRITE dependence in interleaving.\n");
              return true;

it's probably failure to factor in unrolling that breaks this case.

The unvectorized loop body looks like (all but relevant loads/stores elided):

  <bb 20> [local count: 118111594]:
  # i_313 = PHI <_1(19), i_293(24)>
  _146 = *a4_255(D)[_145];
  _152 = *a4_255(D)[_151];
  _165 = *a4_255(D)[_164];
  *a4_255(D)[_194] = _195;
  *a4_255(D)[_201] = _202;
  _203 = *a4_255(D)[_145];
  _290 = *a4_255(D)[_151];
  _291 = *a4_255(D)[_194];
  *a4_255(D)[_194] = M.42_316;
  i_293 = i_313 + 1;
  if (_2 < i_293)

final runtime alias checks are:

create runtime check for data references *a4_255(D)[_151] and *a4_255(D)[_201]
create runtime check for data references *a4_255(D)[_194] and *a4_255(D)[_164]
create runtime check for data references *a4_255(D)[_194] and *a4_255(D)[_145]
create runtime check for data references *a4_255(D)[_164] and *a4_255(D)[_201]

and groups are

note: Detected interleaving load *a4_255(D)[_151] and *a4_255(D)[_194]
note: Detected interleaving load of size 4 starting with _152 =
*a4_255(D)[_151];
note: There is a gap of 2 elements after the group
note: Detected single element interleaving *a4_255(D)[_151] step 32
note: not consecutive access *a4_255(D)[_194] = _195;
note: using strided accesses
note: not consecutive access *a4_255(D)[_194] = M.42_316;
note: using strided accesses
note: Detected single element interleaving *a4_255(D)[_164] step 32
note: Detected single element interleaving *a4_255(D)[_145] step 32
note: Detected single element interleaving *a4_255(D)[_145] step 32
note: not consecutive access *a4_255(D)[_201] = _202;
note: using strided accesses

so there's no SLP involved.

The respective loop doesn't involve a reduction so -ffast-math shouldn't be
required here, only -ffinite-math-only for min/max recognition.

A C testcase mimicing the memory accesses and failing is

void __attribute__((noinline,noclone))
foo (double *a4, int n)
{
  for (int i = 0; i < n; ++i)
    {
      double tem1 = a4[i*4] + a4[i*4+n];
      double tem2 = a4[i*4+2*n+1];
      a4[i*4+n+1] = tem1;
      a4[i*4+1] = tem2;
      double tem3 = a4[i*4] - a4[i*4+1];
      double tem4 = tem3 + a4[i*4+n];
      a4[i*4+n+1] = tem3 + a4[i*4+n+1];
    }
}

int main()
{
  const int n = 5;
  double a4[4 * n * 8];
  double a42[4 * n * 8];
  for (int i = 0; i < 4 * n * 8; ++i)
    a4[i] = a42[i] = i;
  foo (a4, n);
  for (int i = 0; i < n; ++i)
    {
      double tem1 = a42[i*4] + a4[i*4+n];
      double tem2 = a42[i*4+2*n+1];
      a42[i*4+n+1] = tem1;
      a42[i*4+1] = tem2;
      double tem3 = a42[i*4] - a42[i*4+1];
      double tem4 = tem3 + a42[i*4+n];
      a42[i*4+n+1] = tem3 + a42[i*4+n+1];
      __asm__ volatile ("": : : "memory");
    }
  for (int i = 0; i < 4 * n * 8; ++i)
    if (a4[i] != a42[i])
      __builtin_abort ();
  return 0;
}

[Bug tree-optimization/90018] [8 Regression] r265453 miscompiled 527.cam4_r in SPEC CPU 2017

Reply via email to