https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79245

--- Comment #4 from James Greenhalgh <jgreenhalgh at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #3)
> Note the trivial fix will FAIL gcc.dg/tree-ssa/ldist-23.c which looks like
> 
>   int i;
>   for (i = 0; i < 128; ++i)
>     {
>       a[i] = a[i] + 1;
>       b[i] = d[i];
>       c[i] = a[i] / d[i];
>     }
> 
> where the testcase expects b[i] = d[i] to be split out as memcpy but
> the other two partitions to be fused.
> 
> Generally the cost model lacks computing the number of input/output streams
> of a partition and a target interface to query it about limits.  Usually
> store bandwidth is not equal to load bandwidth and not re-used store streams
> can benefit from non-temporal stores being used by libc.
> 
> In your testcase I wonder whether distributing to
> 
>     for (int j = 0; j < x; j++)
>       {
>         for (int i = 0; i < y; i++)
>         {
>           c[j][i] = b[j][i] - a[j][i];
>           }
>       }
>     memcpy (a, b, ...);
> 
> would be faster in the end (or even doing the memcpy first in this case).
> 
> Well, for now let's be more conservative given the cost model really is
> lacking.

The testcase is reduced from CALC3 in 171.swim. I've been seeing a 3%
regression for Cortex-A72 after r242038, and I can fix that with
-fno-tree-loop-distribute-patterns.

In that benchmark you've got 3 instances of the above pattern, so you end up
with 3 memcpy calls after:

      DO 300 J=1,N
      DO 300 I=1,M
      UOLD(I,J) = U(I,J)+ALPHA*(UNEW(I,J)-2.*U(I,J)+UOLD(I,J))
      VOLD(I,J) = V(I,J)+ALPHA*(VNEW(I,J)-2.*V(I,J)+VOLD(I,J))
      POLD(I,J) = P(I,J)+ALPHA*(PNEW(I,J)-2.*P(I,J)+POLD(I,J))
      U(I,J) = UNEW(I,J)
      V(I,J) = VNEW(I,J)
      P(I,J) = PNEW(I,J)
  300 CONTINUE

3 memcpy calls compared to 3 vector store instructions doesn't seem the right
tradeoff to me. Sorry if I reduced the testcase too far to make the balance
clear.

Reply via email to