https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79245

--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
Note the trivial fix will FAIL gcc.dg/tree-ssa/ldist-23.c which looks like

  int i;
  for (i = 0; i < 128; ++i)
    {
      a[i] = a[i] + 1;
      b[i] = d[i];
      c[i] = a[i] / d[i];
    }

where the testcase expects b[i] = d[i] to be split out as memcpy but
the other two partitions to be fused.

Generally the cost model lacks computing the number of input/output streams
of a partition and a target interface to query it about limits.  Usually
store bandwidth is not equal to load bandwidth and not re-used store streams
can benefit from non-temporal stores being used by libc.

In your testcase I wonder whether distributing to

    for (int j = 0; j < x; j++)
      {
        for (int i = 0; i < y; i++)
          {
            c[j][i] = b[j][i] - a[j][i];
          }
      }
    memcpy (a, b, ...);

would be faster in the end (or even doing the memcpy first in this case).

Well, for now let's be more conservative given the cost model really is
lacking.

Reply via email to