[Bug tree-optimization/79245] [7 Regression] Inefficient loop distribution to memcpy

rguenth at gcc dot gnu.org Fri, 27 Jan 2017 02:05:46 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79245


--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
Note the trivial fix will FAIL gcc.dg/tree-ssa/ldist-23.c which looks like

  int i;
  for (i = 0; i < 128; ++i)
    {
      a[i] = a[i] + 1;
      b[i] = d[i];
      c[i] = a[i] / d[i];
    }

where the testcase expects b[i] = d[i] to be split out as memcpy but
the other two partitions to be fused.

Generally the cost model lacks computing the number of input/output streams
of a partition and a target interface to query it about limits.  Usually
store bandwidth is not equal to load bandwidth and not re-used store streams
can benefit from non-temporal stores being used by libc.

In your testcase I wonder whether distributing to

    for (int j = 0; j < x; j++)
      {
        for (int i = 0; i < y; i++)
          {
            c[j][i] = b[j][i] - a[j][i];
          }
      }
    memcpy (a, b, ...);

would be faster in the end (or even doing the memcpy first in this case).

Well, for now let's be more conservative given the cost model really is
lacking.

[Bug tree-optimization/79245] [7 Regression] Inefficient loop distribution to memcpy

Reply via email to